The current payload in use is a simple open/memfd_create/sendfile/fexecve program
I’ve never heard before about memfd_create or fexecve… that’s why this sentence caught my attention.
In this paper we are going to talk about how to use these functions to develop a super-stealthy dropper. You could consider it as a malware development tutorial… but you know that it is illegal to develop and also to deploy malware. This means that, this paper is only for educational purposes… because, after all, a malware analyst needs to know how malware developers do their stuff in order to identify it, neutralise it and do what is needed to keep systems safe.
memfd_create and fexecve
So, after reading that intriguing sentence, I googled those two functions and I saw they were pretty cool. The first one is actually pretty awesome, it allows us to create a file in memory. We have quickly talked about this in [a previous paper] (Running binaries without leaving tracks), but for that we were just using /dev/shm to store our file. That folder is actually stored in memory so, whatever we write there does not end up in the hard-drive (unless we run out of memory and we start swapping). However, the file was visible with a simple ls.
memfd_create does the same, but the memory disk it uses is not mapped into the file system and therefore you cannot find the file with a simple ls. 😮
The second one, fexecve is also pretty awesome. It allows us to execute a program (exactly the same way that execve), but we reference the program to run using a file descriptor, instead of the full path. And this one matches perfectly with memfd_create.
But there is a caveat with this function calls. They are relatively new. memfd_create was introduced in kernel 3.17 and fexecve is a libc function available since version 2.3.2. While, fexecve can be easily implemented when not available (we will see that in a sec), memfd_create is just not there on old kernels…
What does this means?. It means that, at least nowadays, the technique we are going to describe will not work on embedded devices that usually run old kernels and have stripped-down versions of libc. Although, I haven’t checked the availability of fexecve in for instance some routers or Android phones, I believe it is very likely that they are not available. If anybody knows, please drop a line in the comments.
A simple dropper
In order to figure out how these two little guys work, I wrote a simple dropper. Well, it is actually a program able to download some binary from a remote server and run it directly into memory, without dropping it in the disk.
Before continuing, let’s check the Hajime case we described [towards the end of this post] (IoT Malware Droppers (Mirai and Hajime)). There you will find a cryptic shell line that basically creates a file with execution permissions to drop into it another file which is downloaded from the net. Then the downloaded program gets executed and deleted from the disk. In case you don’t want to open the link again, this is the line I’m talking about:
It is pretty short and simple, isn’t it?. But there are a couple of things we have to say about it.
The first thing we have to comment is that, there is no libc wrapper to the memfd_create system call. You would find this information in the memfd_create manpage’s NOTES section 42. That means that we have to write our own wrapper.
First, we need to figure out the syscall index for memfd_create. Just use any on-line syscall list. Remember that the indexes changes with the architecture, so if you plan to use the code on an ARM or a MIPS, you may need to use a different index. The index we used 319 is for x86_64.
You can see the wrapper at the very beginning of the code (just after the #include directives), using the syscall libc function.
Then, the program just does the following:
Create a normal TCP socket
Connect to port 0x1111 on 127.0.0.1 using family AF_INET… We have packed all this information in a long variable to make the code shorter… but you can easily modify this information taking into account that:
addr = 0100007f 11110002;
1. 0. 0.12711110002;
IP Address | Port | Family
Of course this is not standard and whenever the struct sockaddr_in change, the code will break down… but it was cool to write it like this
Creates a memory file
Reads data from the socket and writes it into the memory file
Runs the memory file once all the data has been transferred.
That’s it… very simple and straightforward.
So, now it is time to test it. According to our long constant in the main function, the dropper will connect to port 0x1111 on localhost (127.0.0.1). So we will improvise a file server with netcat.
In one console we just run this command:
$ cat /usr/bin/xeyes | nc -l $((0x1111))
You can chose whatever binary you prefer. I like those little eyes following my mouse pointer all over the place
Then in another console we run the dropper, and those funny xeyes should pop-up in your screen. Let’s see which tracks we can find after running the remote code.
Detecting the dropper
Spotting the process is difficult because we have given it a funny name (kworker/u!0). Note the ! character that is just there to allow me to quickly identify the process for debugging purposes. In reality, you would like to use a : so the process looks like one of those kernel workers. But, let’s look at the ps output.
You can see the output for a legit kworker process in the first line, and then you find our doggy program in the second line… which is associated to a pseudo-terminal!!!.. I think this can be easily avoided… but I will leave this to you to sharp your UNIX development skills
However, even if you detach the process from the pseudo-terminal…
We mentioned that memfd_create will create a file in a RAM filesystem that is not mapped into the normal filesystem tree… at least, if it is mapped, I couldn’t find where. So far this looks like a pretty stealth way to drop a file!!
However, let’s face it, if there is a file somewhere, there should be a way to find it… shouldn’t it? Of course it is. But, when you are in this kind of troubles… who you gonna call?.. Sure… Ghostbusters!. And you know what?, for GNU/Linux systems the way to bust ghosts is using lsof :).
So, we can easily find any memfd file in the system using lsof. Note that lsof will also indicate the associated PID so we can also easily pin point the dropped process even when it is using some name camouflage and it is not associated to a pseudo-terminal!!!
What if memfd_open is not available?
We have mentioned that memfd_open is only available on kernels 3.17 or higher. What can be done for other kernels?. In this case we will be a bit less stealthy but we can still do pretty well.
Our best option in this case is to use shm_open (SHared Memory Open). This function basically creates a file under /dev/shm… however, this one will be visible with ls, but at least we avoid writing to the disk. The only difference between using shm_open or just open is that shm_open will create the files directly under /dev/shm. While, when using open we have to provide the whole path.
To modify the dropper to use shm_open we have to do two things.
First we have to substitute the memfd_create call by a shm_open call like this:
The second thing is that we need to close the file and re-open it read-only in order to be able to execute it with fexecve. So, after the while loop that populates the file we have to close and re-open the file:
if ((fd = shm_open("a", O_RDONLY, 0)) < 0) exit (1);
However note that, now it does not make much sense to use fexecve and we can avoid reopening the file read-only and just call execve on the file created at /dev/shm/ which is effectively the same and it is also shorter.
… and what if fexecve is not available?
This one is pretty easy, whenever you get to know how fexecve works. How can you figure out how the function works?.. just google for its source code!!!. A hint is provided in the man page tho:
On Linux, fexecve() is implemented using the proc(5) file system, so /proc needs to be mounted and available at the time of the call.
So, what it does is to just use execve but providing as file path the file descriptor entry under /proc. Let’s elaborate this a bit more. You know that each open file is identified by an integer and you also know that each process in your GNU/Linux system exports all its related information under the proc pseudo file system in a folder named against its PID (supposing the proc file system is mounted). Well, inside that folder you will find another folder named fd containing a file per each file descriptor opened by the process. Each file is named against its actual file descriptor, that is, the integer number.
Knowing all this, we can run a file identified by a file descriptor just passing the path to the right file under /proc/PID/fd/THE_FD to execve. A basic implementation of fexecve will look like this:
This implementation of fexecve is completely equivalent to the standard one… well it is missing some sanity checks but, after all, we’re living in the edge :P.
As mentioned before, this is very convenient to be used together with memfd_open that returns to us a file descriptor and does not require the close/open sequence. Otherwise, when there is a file somewhere, even in memory, it is even faster to just use execve as you can infer from the implementation above.
Well, this is it. Hope you have found this interesting. It was interesting for me. Now, after having read this paper, you should be able to figure out what the open/memfd_create/sendfile/fexecve we mentioned at the beginning means…
We have also seen a quite stealthy technique to drop files in a remote system. And we have also learn how to detect the dropper even when it may look invisible at first glance.
The main focus of this post, and particularly the associated table of artifacts, is to serve as a reference and reminder of what evidence sources may be available on a particular system during analysis.
On to the main event. The table below details some of the artifacts which evidence program execution and whether they are available for different versions of the Windows Operating System.
Too Small?… It’s a hyperlink!
Cells in Green are where the artifact is available by default, note some artifacts may not be available despite a Green cell (e.g. instances where prefetch is disabled due to an SSD)
Cells in yellow indicate that the artifact is associated with a feature that is disabled by default but that may be enabled by an administrator (e.g. Prefetch on a Windows Server OS) or added through the application of a patch or update (e.g. The introduction of BAM to Windows 10 in 1709+ or back-porting of Amcache to Windows 7 in the optional update KB2952664+)
Cells in Red indicate that the artifact is not available in that version of the OS.
Cells in Grey (containing «TBC») indicate that I’m not 100% sure at the time of writing whether the artifact is present in a particular OS version, that I have more work to do, and that it would be great if you could let me know if you already know the answer!
It is my hope that this table will be helpful to others. It will be updated and certainly at this stage it may be subject to errors as I am reliant upon research and memory of artifacts without having the opportunity to double check each entry through testing. Feedback, both in the form of suggested additions and any required corrections is very much appreciated and encouraged.
Summary of Artifacts
What follows below is brief details on the availability of these artifacts, some useful resources for additional information and tools for parsing them. It is not my intention to go into detail as to the functioning of the artifacts as this is generally already well covered within the references.
Prefetch has historically been the go to indication of process execution. If enabled, it can provide a wealth of useful data in an investigation or incident response. However, since Windows 7, systems with an SSD installed as the OS volume have had prefetch disabled by default during installation. With that said, I have seen plenty of systems with SSDs which have still had prefetch enabled (particularaly in businesses which push a standard image) so it is always worth checking for. Windows Server installations also have Prefetch disabled by default, but the same applies.
The following registry key can be used to determine if it is enabled:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters\EnablePrefetcher
0 = Disabled
1 = Only Application launch prefetching enabled
2 = Only Boot prefetching enabled
3 = Both Application launch and Boot prefetching enabled
It should be noted that the presence of an entry for an executable within the ShimCache doesn’t always mean it was executed as merely navigating to it can cause it to be listed. Additionally Windows XP ShimCache is limited to 96 entries all versions since then retain up to 1024 entries.
ShimCache has one further notable drawback. The information is retained in memory and is only written to the registry when the system is shutdown. Data can be retrieved from a memory image if available.
Amcache.hve within Windows 8+ and RecentFileCache.bcf within Windows 7 are two distinct artifacts which are used by the same mechanism in Windows to track application compatibility issues with different executables. As such it can be used to determine when executables were first run.
Both of these system logs are related to the Application Experience and Compatibility features implemented in modern versions of Windows.
At the time of testing I find none of my desktop systems have the Inventory log populated, while the Telemetry log seems to contain useful information. I have however seen various discussion online indicating that the Inventory log is populated in Windows 10. It is likely that my disabling of all tracking and reporting functions on my personal systems and VMs may be the cause… more testing required.
The Background Activity Monitor (BAM) and (DAM) registry keys within the SYSTEM registry hive, however as it records them under the SID of the associated user it is user attributable. The key details the path of executable files that have been executed and last execution date/time
It was introduced to Windows 10 in 1709 (Fall Creators update).
In Windows 10 1803 (April 2018) Update, Microsoft introduced the Timeline feature, and all forensicators did rejoice. This artifact is a goldmine for user activity analysis and the associated data is stored within an ActivitiesCache.db located within each users profile.
Event ID 7035 within the System event log is recorded by the Service Control Manager when a Service starts or stops. As such it can be an indication of execution if the associated process is registered as a service.
The RecentApps key is located in the NTUSER.DAT associated with each user and contains a record of their… Recent Applications. The presence of keys associated with a particular executable evidence the fact that this user ran the executable.
Implemented in Windows 7, Jumplists are a mechanism by which Windows records and presents recent documents and applications to users. Located within individual users profiles the presence of references to executable(s) within the ‘Recent\AutomaticDestinations’ can be used to evidence the fact that they were run by the user.
The RunMRU is a list of all commands typed into the Run box on the Start menu and is recorded within the NTUSER.DAT associated with each user. Commands referencing executables can be used to determine if, how and when the executable was run and which user account was associated with running it.
Hooking Linux Kernel Functions, how to Hook Functions with Ftrace
Ftrace is a Linux kernel framework for tracing Linux kernel functions. But our team managed to find a new way to use ftrace when trying to enable system activity monitoring to be able to block suspicious processes. It turns out that ftrace allows you to install hooks from a loadable GPL module without rebuilding the kernel. This approach works for Linux kernel versions 3.19 and higher for the x86_64 architecture.
A new approach: Using ftrace for Linux kernel hooking
What is an ftrace? Basically, ftrace is a framework used for tracing the kernel on the function level. This framework has been in development since 2008 and has quite an impressive feature set. What data can you usually get when you trace your kernel functions with ftrace? Linux ftrace displays call graphs, tracks the frequency and length of function calls, filters particular functions by templates, and so on. Further down this article you’ll find references to official documents and sources you can use to learn more about the capabilities of ftrace.
The implementation of ftrace is based on the compiler options -pg and -mfentry. These kernel options insert the call of a special tracing function — mcount() or __fentry__() — at the beginning of every function. In user programs, profilers use this compiler capability for tracking calls of all functions. In the kernel, however, these functions are used for implementing the ftrace framework.
Calling ftrace from every function is, of course, pretty costly. This is why there’s an optimization available for popular architectures — dynamic ftrace. If ftrace isn’t in use, it nearly doesn’t affect the system because the kernel knows where the calls mcount() or __fentry__() are located and replaces the machine code with nop (a specific instruction that does nothing) at an early stage. And when Linux kernel trace is on, ftrace calls are added back to the necessary functions.
Description of necessary functions
The following structure can be used for describing each hooked function:
* struct ftrace_hook describes the hooked function
* @name: the name of the hooked function
* @function: the address of the wrapper function that will be called instead
* of the hooked function
* @original: a pointer to the place where the address
* of the hooked function should be stored, filled out during installation
* of the hook
* @address: the address of the hooked function, filled out during installation
* of the hook
* @ops: ftrace service information, initialized by zeros;
* initialization is finished during installation of the hook
There are only three fields that the user needs to fill in: name, function, and original. The rest of the fields are considered to be implementation details. You can put the description of all hooked functions together and use macros to make the code more compact:
Now, hooked functions have a minimum of extra code. The only thing requiring special attention is the function signatures. They must be completely identical; otherwise, the arguments will be passed on incorrectly and everything will go wrong. This isn’t as important for hooking system calls, though, since their handlers are pretty stable and, for performance reasons, the system call ABI and function call ABI use the same layout of arguments in registers. However, if you’re going to hook other functions, remember that the kernel has no stable interfaces.
Our first step is finding and saving the hooked function address. As you probably know, when using ftrace, Linux kernel tracing can be performed by the function name. However, we still need to know the address of the original function in order to call it.
You can use kallsyms — a list of all kernel symbols — to get the address of the needed function. This list includes not only symbols exported for the modules but actually all symbols. This is what the process of getting the hooked function address can look like:
Next, we need to initialize the ftrace_ops structure. Here we have one necessary field, func, pointing to the callback. However, some critical flags are needed:
intfh_install_hook (structftrace_hook *hook)
err = resolve_hook_address(hook);
hook->ops.func = fh_ftrace_thunk;
hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS
/* ... */
The fh_ftrace_thunk () feature is our callback that ftrace will call when tracing the function. We’ll talk about this callback later. The flags are needed for hooking — they command ftrace to save and restore the processor registers whose contents we’ll be able to change in the callback.
Now we’re ready to turn on the hook. First, we use ftrace_set_filter_ip() to turn on the ftrace utility for the needed function. Second, we use register_ftrace_function() to give ftrace permission to call our callback:
When the unregister_ftrace_function() call is over, it’s guaranteed that there won’t be any activations of the installed callback or our wrapper in the system. We can unload the hook module without worrying that our functions are still being executed somewhere in the system. Next, we provide a detailed description of the function hooking process.
Hooking functions with ftrace
So how can you configure kernel function hooking? The process is pretty simple: ftrace is able to alter the register state after exiting the callback. By changing the register %rip — a pointer to the next executed instruction — we can change the function executed by the processor. In other words, we can force the processor to make an unconditional jump from the current function to ours and take over control.
We get the address of struct ftrace_hook for our function using a macro container_of() and the address of struct ftrace_ops embedded in struct ftrace_hook. Next, we substitute the value of the register %rip in the struct pt_regs structure with our handler’s address. For architectures other than x86_64, this register can have a different name (like PC or IP). The basic idea, however, still applies.
Note that the notrace specifier added for the callback requires special attention. This specifier can be used for marking functions that are prohibited for Linux kernel tracing with ftrace. For instance, you can mark ftrace functions that are used in the tracing process. By using this specifier, you can prevent the system from hanging if you accidentally call a function from your ftrace callback that’s currently being traced by ftrace.
The ftrace callback is usually called with a disabled preemption (just like kprobes), although there might be some exceptions. But in our case, this limitation wasn’t important since we only needed to replace eight bytes of %rip value in the pt_regs structure.
Since the wrapper function and the original are executed in the same context, both functions have the same restrictions. For instance, if you hook an interrupt handler, then sleeping in the wrapper is still out of the question.
Protection from recursive calls
There’s one catch in the code we gave you before: when the wrapper calls the original function, the original function will be traced by ftrace again, thus causing an endless recursion. We came up with a pretty neat way of breaking this cycle by using parent_ip — one of the ftrace callback arguments that contains the return address to the function that called the hooked one. Usually, this argument is used for building function call graphs. However, we can use this argument to distinguish the first traced function call from the repeated calls.
The difference is significant: during the first call, the argument parent_ip will point to some place in the kernel, while during the repeated call it will only point inside our wrapper. You should pass control only during the first function call. All other calls must let the original function be executed.
We can run the entry test by comparing the address to the boundaries of the current module with our functions. However, this approach works only if the module doesn’t contain anything other than the wrapper that calls the hooked function. Otherwise, you’ll need to be more picky.
So this is what a correct ftrace callback looks like:
/* Skip the function calls from the current module. */
regs->ip = (unsigned long) hook->function;
This approach has three main advantages:
Low overhead costs. You need to perform only several comparisons and subtractions without grabbing any spinlocks or iterating through lists.
It doesn’t have to be global. Since there’s no synchronization, this approach is compatible with preemption and isn’t tied to the global process list. As a result, you can trace even interrupt handlers.
There are no limitations for functions. This approach doesn’t have the main kretprobes drawback and can support any number of trace function activations (including recursive) out of the box. During recursive calls, the return address is still located outside of our module, so the callback test works correctly.
In the next section, we take a more detailed look at the hooking process and describe how ftrace works.
The scheme of the hooking process
So, how does ftrace work? Let’s take a look at a simple example: you’ve typed the command Is in the terminal to see the list of files in the current directory. The command-line interpreter (say, Bash) launches a new process using the common functions fork() plus execve() from the standard C library. Inside the system, these functions are implemented through system calls clone() and execve() respectively. Let’s suggest that we hook the execve() system call to gain control over launching new processes.
Figure 1 below gives an ftrace example and illustrates the process of hooking a handler function.
Figure 1. Linux kernel hooking with ftrace.
In this image, we can see how a user process (blue) executes a system call to the kernel (red) where the ftrace framework (violet) calls functions from our module (green).
Below, we give a more detailed description of each step of the process:
The SYSCALL instruction is executed by the user process. This instruction allows switching to the kernel mode and puts the low-level system call handler entry_SYSCALL_64() in charge. This handler is responsible for all system calls of 64-bit programs on 64-bit kernels.
A specific handler receives control. The kernel accomplishes all low-level tasks implemented on the assembler pretty fast and hands over control to the high-level do_syscall_64 () function, which is written in C. This function reaches the system call handler table sys_call_table and calls a particular handler by the system call number. In our case, it’s the function sys_execve ().
Calling ftrace. There’s an __fentry__() function call at the beginning of every kernel function. This function is implemented by the ftrace framework. In the functions that don’t need to be traced, this call is replaced with the instruction nop. However, in the case of the sys_execve() function, there’s no such call.
Ftrace calls our callback. Ftrace calls all registered trace callbacks, including ours. Other callbacks won’t interfere since, at each particular place, only one callback can be installed that changes the value of the %rip register.
The callback performs the hooking. The callback looks at the value of parent_ip leading inside the do_syscall_64() function — since it’s the particular function that called the sys_execve() handler — and decides to hook the function, changing the values of the register %rip in the pt_regs structure.
Ftrace restores the state of the registers. Following the FTRACE_SAVE_REGS flag, the framework saves the register state in the pt_regs structure before it calls the handlers. When the handling is over, the registers are restored from the same structure. Our handler changes the register %rip — a pointer to the next executed function — which leads to passing control to a new address.
Wrapper function receives control. An unconditional jump makes it look like the activation of the sys_execve() function has been terminated. Instead of this function, control goes to our function, fh_sys_execve(). Meanwhile, the state of both processor and memory remains the same, so our function receives the arguments of the original handler and returns control to the do_syscall_64() function.
The original function is called by our wrapper. Now, the system call is under our control. After analyzing the context and arguments of the system call, the fh_sys_execve() function can either permit or prohibit execution. If execution is prohibited, the function returns an error code. Otherwise, the function needs to repeat the call to the original handler and sys_execve() is called again through the real_sys_execve pointer that was saved during the hook setup.
The callback gets control. Just like during the first call of sys_execve(), control goes through ftrace to our callback. But this time, the process ends differently.
The callback does nothing. The sys_execve() function was called not by the kernel from do_syscall_64() but by our fh_sys_execve() function. Therefore, the registers remain unchanged and the sys_execve() function is executed as usual. The only problem is that ftrace sees the entry to sys_execve() twice.
The wrapper gets back control. The system call handler sys_execve() gives control to our fh_sys_execve() function for the second time. Now, the launch of a new process is nearly finished. We can see if the execve() call finished with an error, study the new process, make some notes to the log file, and so on.
The kernel receives control. Finally, the fh_sys_execve() function is finished and control returns to the do_syscall_64() function. The function sees the call as one that was completed normally, and the kernel proceeds as usual.
Control goes to the user process. In the end, the kernel executes the IRET instruction (or SYSRET, but for execve() there can be only IRET), installing the registers for a new user process and switching the processor into user code execution mode. The system call is over and so is the launch of the new process.
As you can see, the process of hooking Linux kernel function calls with ftrace isn’t that complex.
Even though the main purpose of ftrace is to trace Linux kernel function calls rather than hook them, our innovative approach turned out to be both simple and effective. However, the approach we describe above works only for kernel versions 3.19 and higher and only for the x86_64 architecture.
In the final part of our series, we’ll tell you about the main ftrace pros and cons and some unexpected surprises that might be waiting for you if you decide to implement this approach. Meanwhile, you can read about another unusual solution for installing hooks — by using the GCC attribute constructor with LD_PRELOAD.
What Are the Main Pros and Cons of Ftrace?
Ftrace is a Linux utility that ’s usually used for tracing kernel functions. But as we looked for a useful solution that would allow us to enable system activity monitoring and block suspicious processes, we discovered that Linux ftrace can also be used for hooking function calls.
Pros and cons of using ftrace
Ftrace makes hooking Linux kernel functions much easier and has several crucial advantages.
A mature API and simple code. Leveraging ready-to-use interfaces in the kernel significantly reduces code complexity. You can hook your kernel functions with ftrace by making only a couple of function calls, filling in two structure fields, and adding a bit of magic in the callback. The rest of the code is just business logic executed around the traced function.
Ability to trace any function by name. Linux kernel tracing with ftrace is quite a simple process – writing the function name in a regular string is enough to point to the one you need. You don’t need to struggle with the linker, scan the memory, or investigate internal kernel data structures. As long as you know their names, you can trace your kernel functions with ftrace even if those functions aren’t exported for the modules.
But just like the other approaches that we’ve described in this series, ftrace has a couple of drawbacks.
Kernel configuration requirements. There are several kernel requirements needed to ensure successful ftrace Linux kernel tracing:
The list of kallsyms symbols for searching functions by name
The ftrace framework as a whole for performing tracing
Ftrace options crucial for hooking functions
All these features can be disabled in the kernel configuration since they aren’t critical for the system’s functioning. Usually, however, the kernels used by popular distributions still contain all these kernel options as they don’t affect system performance significantly and may be useful for debugging. Still, you’d better keep these requirements in mind in case you need to support some particular kernels.
Overhead costs. Since ftrace doesn’t use breakpoints, it has lower overhead costs than kprobes. However, the overhead costs are higher than for splicing manually. In fact, dynamic ftrace is a variation of splicing which executes the unneeded ftrace code and other callbacks.
Functions are wrapped as a whole. Just as with usual splicing, ftrace wraps the functions as a whole. And while splicing technically can be executed in any part of the function, ftrace works only at the entry point. You can see this limitation as a disadvantage, but usually it doesn’t cause any complications.
Double ftrace calls. As we’ve explained before, using the parent_ip pointer for analysis leads to calling ftrace twice for the same hooked function. This adds some overhead costs and can disrupt the readings of other traces because they’ll see twice as many calls. This issue can be fixed by moving the original function address five bytes further (the length of the call instruction) so you can basically spring over ftrace.
Let’s take a closer look at some of these disadvantages.
Kernel configuration requirements
The kernel has to support both ftrace and kallsyms. This requires enabling two configuration options:
Next, ftrace has to support a dynamic register modification, which is the responsibility of the following option:
To access the FTRACE_OPS_FL_IPMODIFY flag, the kernel you use has to be based on version 3.19 or higher. Older kernel versions can still modify the register %rip, but from version 3.19, this register can be modified only after setting the flag. In older versions of the kernel, the presence of this flag will lead to a compilation error. For newer versions, the absence of this flag means a non-operating hook.
Last but not least, we need to pay attention to the ftrace call location inside the function. The ftrace call must be located at the beginning of the function, before the function prologue (where the stack frame is formed and the space for local variables is allocated). The following option takes this feature into account:
While the x86_64 architecture does support this option, the i386 architecture doesn’t. The compiler can’t insert an ftrace call before the function prologue due to ABI limitations of the i386 architecture. As a result, by the time you perform an ftrace call the function stack has already been modified, and changing the value of the register isn’t enough for hooking the function. You’ll also need to undo the actions executed in the prologue, which differ from function to function.
This is why ftrace function hooking doesn’t support a 32-bit x86 architecture. In theory, you can still implement this approach by generating and executing an anti-prologue, for instance, but it’ll significantly boost the technical complexity.
Unexpected surprises when using ftrace
At the testing stage, we faced one particular peculiarity: hooking functions on some distributions led to the permanent hanging of the system. Of course, this problem occurred only on systems that were different from those used by our developers. We also couldn’t reproduce the problem with the initial hooking prototype on any distributions or kernel versions.
According to debugging, the system got stuck inside the hooked function. For some unknown reason, the parent_ip still pointed to the kernel instead of the function wrapper when calling the original function inside the ftrace callback. This launched an endless loop wherein ftrace called our wrapper again and again while doing nothing useful.
Fortunately, we had both working and broken code and eventually discovered what was causing the problem. When we unified the code and got rid of the pieces we didn’t need at the moment, we narrowed down the differences between the two versions of the wrapper function code.
How can the logging level possibly affect system behavior? Surprisingly enough, when we took a closer look at the machine code of these two functions, it became obvious that the reason behind these problems was the compiler.
It turns out that the pr_devel() calls are expanded into no-op. This printk-macro version is used for logging at the development stage. And since these logs pose no interest at the operating stage, the system simply cuts them out of the code automatically unless you activate the DEBUG macro. After that, the compiler sees the function like this:
And this is where optimizations take the stage. In our case, the so-called tail call optimization was activated. If a function calls another and returns its value immediately, this optimization lets the compiler replace a function call instruction with a cheaper direct jump to the function’s body. This is what this call looks like in machine code:
0: e8 00 00 00 00 callq 5 <fh_sys_execve+0x5>
5: ff 15 00 00 00 00 callq *0x0(%rip)
b: f3 c3 repz retq </fh_sys_execve>
And this is an example of the broken call:
0: e8 00 00 00 00 callq 5 <fh_sys_execve+0x5>
5: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
c: ff e0 jmpq *%rax </fh_sys_execve>
The first CALL instruction is the exact same __fentry__() call that the compiler inserts at the beginning of all functions. But after that, the broken and the stable code act differently. In the stable code, we can see the real_sys_execve call (via a pointer stored in memory) performed by the CALL instruction, which is followed by fh_sys_execve() with the help of the RET instruction. In the broken code, however, there’s a direct jump to the real_sys_execve() function performed by JMP.
The tail call optimization allows you to save some time by not allocating a useless stack frame that includes the return address that the CALL instruction stores in the stack. But since we’re using parent_ip to decide whether we need to hook, the accuracy of the return address is crucial for us. After optimization, the fh_sys_execve() function doesn’t save the new address on the stack anymore, so there’s only the old one leading to the kernel. And this is why the parent_ip keeps pointing inside the kernel and that endless loop appears in the first place.
This is also the main reason why the problem appeared only on some distributions. Different distributions use different sets of compilation flags for compiling the modules. And in all the problem distributions, the tail call optimization was active by default.
We managed to solve this problem by turning off tail call optimization for the entire file with the wrapper functions:
For further hooking experiments, you can use the full kernel module code from GitHub.
While developers typically use ftrace to trace Linux kernel function calls, this utility showed itself to be rather useful for hooking Linux kernel functions as well. And even though this approach has some disadvantages, it gives you one crucial benefit: overall simplicity of both the code and the hooking process.