# Modlishka — An Open Source Phishing Tool With 2FA Authentication

Modlishka is a flexible and powerful reverse proxy, that will take your phishing campaigns to the next level (with minimal effort required from your side).
Enjoy 🙂

Features
Some of the most important ‘Modlishka’ features :

• Support for majority of 2FA authentication schemes (by design).
• No website templates (just point Modlishka to the target domain — in most cases, it will be handled automatically).
• Full control of «cross» origin TLS traffic flow from your victims browsers.
• Flexible and easily configurable phishing scenarios through configuration options.
• Pattern based JavaScript payload injection.
• Striping website from all encryption and security headers (back to 90’s MITM style).
• User credential harvesting (with context based on URL parameter passed identifiers).
• Can be extended with your ideas through plugins.
• Stateless design. Can be scaled up easily for an arbitrary number of users — ex. through a DNS load balancer.
• Web panel with a summary of collected credentials and user session impersonation (beta).
• Written in Go.

Action
«A picture is worth a thousand words»:
Modlishka in action against an example 2FA (SMS) enabled authentication scheme:

Note: google.com was chosen here just as a POC.

Installation
Latest source code version can be fetched from here (zip) or here (tar).
Fetch the code with ‘go get’ :

$go get -u github.com/drk1wi/Modlishka Compile the binary and you are ready to go: $ cd $GOPATH/src/github.com/drk1wi/Modlishka/$ make
# ./dist/proxy -h

Usage of ./dist/proxy:

-cert string
base64 encoded TLS certificate

-certKey string
base64 encoded TLS certificate key

-certPool string
base64 encoded Certification Authority certificate

-config string
JSON configuration file. Convenient instead of using command line switches.

-credParams string

-debug
Print debug information

-disableSecurity
Disable security features like anti-SSRF. Disable at your own risk.

-jsRules string
Comma separated list of URL patterns and JS base64 encoded payloads that will be injected.

-listeningPort string
Listening port (default "443")

-log string
Local file to which fetched requests will be written (appended)

-phishing string
Phishing domain to create - Ex.: target.co

-plugins string
Comma seperated list of enabled plugin names (default "all")

-postOnly
Log only HTTP POST requests

-rules string
Comma separated list of 'string' patterns and their replacements.

-target string
Main target to proxy - Ex.: https://target.com

-targetRes string
Comma separated list of target subdomains that need to pass through the  proxy

-terminateTriggers string
Comma separated list of URLs from target's origin which will trigger session termination

-terminateUrl string
URL to redirect the client after session termination triggers

-tls
Enable TLS (default false)

Name of the HTTP cookie used to track the victim (default "id")

-trackingParam string
Name of the HTTP parameter used to track the victim (default "id")

Usage

• Check out the wiki page for a more detailed overview of the tool usage.
• Blog post

Credits
Thanks for helping with the code go to Giuseppe Trotta (@Giutro)

Реклама

# Introduction

Hello and welcome back to the fifth part of the “Hypervisor From Scratch” tutorial series. Today we will be configuring our previously allocated Virtual Machine Control Structure (VMCS) and in the last, we execute VMLAUNCH and enter to our hardware-virtualized world! Before reading the rest of this part, you have to read the previous parts as they are really dependent.

The full source code of this tutorial is available on GitHub :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch]

Most of this topic derived from Chapter 24 – (VIRTUAL MACHINE CONTROL STRUCTURES) & Chapter 26 – (VM ENTRIES) available at Intel 64 and IA-32 architectures software developer’s manual combined volumes 3. Of course, for more information, you can read the manual as well.

• Introduction
• VMX Instructions
• VMPTRST
• VMCLEAR
• VMPTRLD
• Enhancing VM State Structure
• Preparing to launch VM
• VMX Configurations
• Saving a return point
• Returning to the previous state
• VMLAUNCH
• VMX Controls
• VM-Execution Controls
• VM-entry Control Bits
• VM-exit Control Bits
• PIN-Based Execution Control
• Interruptibility State
• Configuring VMCS
• Gathering Machine state for VMCS
• Setting up VMCS
• Checking VMCS Layout
• VM-Exit Handler
• Resume to next instruction
• VMRESUME
• Let’s Test it!
• Conclusion
• References

This part is highly inspired from Hypervisor For Beginner and some of methods are exactly like what implemented in that project.

# VMX Instructions

In part 3, we implemented VMXOFF function now let’s implement other VMX instructions function. I also make some changes in calling VMXON and VMPTRLD functions to make it more modular.

# VMPTRST

VMPTRST stores the current-VMCS pointer into a specified memory address. The operand of this instruction is always 64 bits and it’s always a location in memory.

The following function is the implementation of VMPTRST:

 12345678910 UINT64 VMPTRST(){    PHYSICAL_ADDRESS vmcspa;    vmcspa.QuadPart = 0;    __vmx_vmptrst((unsigned __int64 *)&vmcspa);     DbgPrint(«[*] VMPTRST %llx\n», vmcspa);     return 0;}

# VMCLER

This instruction applies to the VMCS which VMCS region resides at the physical address contained in the instruction operand. The instruction ensures that VMCS data for that VMCS (some of these data may be currently maintained on the processor) are copied to the VMCS region in memory. It also initializes some parts of the VMCS region (for example, it sets the launch state of that VMCS to clear).

 123456789101112131415 BOOLEAN Clear_VMCS_State(IN PVirtualMachineState vmState) {     // Clear the state of the VMCS to inactive    int status = __vmx_vmclear(&vmState->VMCS_REGION);     DbgPrint(«[*] VMCS VMCLAEAR Status is : %d\n», status);    if (status)    {        // Otherwise terminate the VMX        DbgPrint(«[*] VMCS failed to clear with status %d\n», status);        __vmx_off();        return FALSE;    }    return TRUE;}

# VMPTRLD

It marks the current-VMCS pointer valid and loads it with the physical address in the instruction operand. The instruction fails if its operand is not properly aligned, sets unsupported physical-address bits, or is equal to the VMXON pointer. In addition, the instruction fails if the 32 bits in memory referenced by the operand do not match the VMCS revision identifier supported by this processor.

 12345678910 BOOLEAN Load_VMCS(IN PVirtualMachineState vmState) {     int status = __vmx_vmptrld(&vmState->VMCS_REGION);    if (status)    {        DbgPrint(«[*] VMCS failed with status %d\n», status);        return FALSE;    }    return TRUE;}

In order to implement VMRESUME you need to know about some VMCS fields so the implementation of VMRESUME is after we implement VMLAUNCH. (Later in this topic)

# Enhancing VM State Structure

As I told you in earlier parts, we need a structure to save the state of our virtual machine in each core separately. The following structure is used in the newest version of our hypervisor, each field will be described in the rest of this topic.

 123456789 typedef struct _VirtualMachineState{    UINT64 VMXON_REGION;                    // VMXON region    UINT64 VMCS_REGION;                     // VMCS region    UINT64 EPTP;                            // Extended-Page-Table Pointer    UINT64 VMM_Stack;                       // Stack for VMM in VM-Exit State    UINT64 MSRBitMap;                       // MSRBitMap Virtual Address    UINT64 MSRBitMapPhysical;               // MSRBitMap Physical Address} VirtualMachineState, *PVirtualMachineState;

Note that its not the final _VirtualMachineState structure and we’ll enhance it in future parts.

# Preparing to launch VM

In this part, we’re just trying to test our hypervisor in our driver, in the future parts we add some user-mode interactions with our driver so let’s start with modifying our DriverEntry as it’s the first function that executes when our driver is loaded.

Below all the preparation from Part 2, we add the following lines to use our Part 4 (EPT) structures :

 123 // Initiating EPTP and VMX PEPTP EPTP = Initialize_EPTP(); Initiate_VMX();

I added an export to a global variable called “VirtualGuestMemoryAddress” that holds the address of where our guest code starts.

Now let’s fill our allocated pages with \xf4 which stands for HLT instruction. I choose HLT because with some special configuration (described below) it’ll cause VM-Exit and return the code to the Host handler.

Let’s create a function which is responsible for running our virtual machine on a specific core.

 1 void LaunchVM(int ProcessorID , PEPTP EPTP);

I set the ProcessorID to 0, so we’re in the 0th logical processor.

Keep in mind that every logical core has its own VMCS and if you want your guest code to run in other logical processor, you should configure them separately.

Now we should set the affinity to the specific logical processor using Windows KeSetSystemAffinityThread function and make sure to choose the specific core’s vmState as each core has its own separate VMXON and VMCS region.

Now, we should allocate a specific stack so that every time a VM-Exit occurs then we can save the registers and calling other Host functions.

I prefer to allocate a separate location for stack instead of using current RSP of the driver but you can use current stack (RSP) too.

The following lines are for allocating and zeroing the stack of our VM-Exit handler.

 12345678910 // Allocate stack for the VM Exit Handler. UINT64 VMM_STACK_VA = ExAllocatePoolWithTag(NonPagedPool, VMM_STACK_SIZE, POOLTAG); vmState[ProcessorID].VMM_Stack = VMM_STACK_VA;  if (vmState[ProcessorID].VMM_Stack == NULL) { DbgPrint(«[*] Error in allocating VMM Stack.\n»); return; } RtlZeroMemory(vmState[ProcessorID].VMM_Stack, VMM_STACK_SIZE);

Same as above, allocating a page for MSR Bitmap and adding it to vmState, I’ll describe about them later in this topic.

 1234567891011 // Allocate memory for MSRBitMap vmState[ProcessorID].MSRBitMap = MmAllocateNonCachedMemory(PAGE_SIZE);  // should be aligned if (vmState[ProcessorID].MSRBitMap == NULL) { DbgPrint(«[*] Error in allocating MSRBitMap.\n»); return; } RtlZeroMemory(vmState[ProcessorID].MSRBitMap, PAGE_SIZE); vmState[ProcessorID].MSRBitMapPhysical = VirtualAddress_to_PhysicalAddress(vmState[ProcessorID].MSRBitMap);

Now it’s time to clear our VMCS state and load it as the current VMCS in the specific processor (in our case the 0th logical processor).

The Clear_VMCS_State and Load_VMCS are described above :

 123456789101112 // Clear the VMCS State if (!Clear_VMCS_State(&vmState[ProcessorID])) { goto ErrorReturn; }  // Load VMCS (Set the Current VMCS) if (!Load_VMCS(&vmState[ProcessorID])) { goto ErrorReturn; }

Now it’s time to setup VMCS, A detailed explanation of VMCS setup is available later in this topic.

 1234 DbgPrint(«[*] Setting up VMCS.\n»); Setup_VMCS(&vmState[ProcessorID], EPTP);

The last step is to execute the VMLAUNCH but we shouldn’t forget about saving the current state of the stack (RSP & RBP) because during the execution of Guest code and after returning from VM-Exit, we have to now the current state and return from it. It’s because if you leave the driver with wrong RSP & RBP then you definitely see a BSOD.

 12 Save_VMXOFF_State();

# Saving a return point

For Save_VMXOFF_State() , I declared two global variables called g_StackPointerForReturningg_BasePointerForReturning. No need to save RIP as the return address is always available in the stack. Just EXTERN it in the assembly file :

 123 EXTERN g_StackPointerForReturning:QWORDEXTERN g_BasePointerForReturning:QWORD

The implementation of Save_VMXOFF_State :

 123456 Save_VMXOFF_State PROC PUBLICMOV g_StackPointerForReturning,rspMOV g_BasePointerForReturning,rbpret Save_VMXOFF_State ENDP

# Returning to the previous state

As we saved the current state, if we want to return to the previous state, we have to restore RSP & RBP and clear the stack position and eventually a RET instruction. (I Also add a VMXOFF because it should be executed before return.)

 123456789101112131415161718192021222324 Restore_To_VMXOFF_State PROC PUBLIC VMXOFF  ; turn it off before existing MOV rsp, g_StackPointerForReturningMOV rbp, g_BasePointerForReturning ; make rsp point to a correct return pointADD rsp,8 ; return Truexor rax,raxmov rax,1 ; return section mov     rbx, [rsp+28h+8h]mov     rsi, [rsp+28h+10h]add     rsp, 020hpop     rdi ret Restore_To_VMXOFF_State ENDP

The “return section” is defined like this because I saw the return section of LaunchVM in IDA Pro.

One important thing that can’t be easily ignored from the above picture is I have such a gorgeous, magnificent & super beautiful IDA PRO theme. I always proud of myself for choosing themes like this !

# VMLAUNCH

Now it’s time to executed the VMLAUNCH.

 12345678910 __vmx_vmlaunch();  // if VMLAUNCH succeed will never be here ! ULONG64 ErrorCode = 0; __vmx_vmread(VM_INSTRUCTION_ERROR, &ErrorCode); __vmx_off(); DbgPrint(«[*] VMLAUNCH Error : 0x%llx\n», ErrorCode); DbgBreakPoint();

As the comment describes, if we VMLAUNCH succeed we’ll never execute the other lines. If there is an error in the state of VMCS (which is a common problem) then we have to run VMREAD and read the error code from VM_INSTRUCTION_ERROR field of VMCS, also VMXOFF and print the error. DbgBreakPoint is just a debug breakpoint (int 3) and it can be useful only if you’re working with a remote kernel Windbg Debugger. It’s clear that you can’t test it in your system because executing a cc in the kernel will freeze your system as long as there is no debugger to catch it so it’s highly recommended to create a remote Kernel Debugging machine and test your codes.

Also, It can’t be tested on a remote VMWare debugging (and other virtual machine debugging tools) because nested VMX is not supported in current Intel processors.

Remember we’re still in LaunchVM function and __vmx_vmlaunch() is the intrinsic function for VMLAUNCH & __vmx_vmread is for VMREAD instruction.

Now it’s time to read some theories before configuring VMCS.

# VM-Execution Controls

In order to control our guest features, we have to set some fields in our VMCS. The following tables represent the Primary Processor-Based VM-Execution Controls and Secondary Processor-Based VM-Execution Controls.

We define the above table like this:

 123456789101112131415161718192021 #define CPU_BASED_VIRTUAL_INTR_PENDING        0x00000004#define CPU_BASED_USE_TSC_OFFSETING           0x00000008#define CPU_BASED_HLT_EXITING                 0x00000080#define CPU_BASED_INVLPG_EXITING              0x00000200#define CPU_BASED_MWAIT_EXITING               0x00000400#define CPU_BASED_RDPMC_EXITING               0x00000800#define CPU_BASED_RDTSC_EXITING               0x00001000#define CPU_BASED_CR3_LOAD_EXITING            0x00008000#define CPU_BASED_CR3_STORE_EXITING           0x00010000#define CPU_BASED_CR8_LOAD_EXITING            0x00080000#define CPU_BASED_CR8_STORE_EXITING           0x00100000#define CPU_BASED_TPR_SHADOW                  0x00200000#define CPU_BASED_VIRTUAL_NMI_PENDING         0x00400000#define CPU_BASED_MOV_DR_EXITING              0x00800000#define CPU_BASED_UNCOND_IO_EXITING           0x01000000#define CPU_BASED_ACTIVATE_IO_BITMAP          0x02000000#define CPU_BASED_MONITOR_TRAP_FLAG           0x08000000#define CPU_BASED_ACTIVATE_MSR_BITMAP         0x10000000#define CPU_BASED_MONITOR_EXITING             0x20000000#define CPU_BASED_PAUSE_EXITING               0x40000000#define CPU_BASED_ACTIVATE_SECONDARY_CONTROLS 0x80000000

In the earlier versions of VMX, there is nothing like Secondary Processor-Based VM-Execution Controls. Now if you want to use the secondary table you have to set the 31st bit of the first table otherwise it’s like the secondary table field with zeros.

The definition of the above table is this (we ignore some bits, you can define them if you want to use them in your hypervisor):

 12345 #define CPU_BASED_CTL2_ENABLE_EPT            0x2#define CPU_BASED_CTL2_RDTSCP                0x8#define CPU_BASED_CTL2_ENABLE_VPID            0x20#define CPU_BASED_CTL2_UNRESTRICTED_GUEST    0x80#define CPU_BASED_CTL2_ENABLE_VMFUNC        0x2000

# VM-entry Control Bits

The VM-entry controls constitute a 32-bit vector that governs the basic operation of VM entries.

 12345 // VM-entry Control Bits #define VM_ENTRY_IA32E_MODE             0x00000200#define VM_ENTRY_SMM                    0x00000400#define VM_ENTRY_DEACT_DUAL_MONITOR     0x00000800#define VM_ENTRY_LOAD_GUEST_PAT         0x00004000

# VM-exit Control Bits

The VM-exit controls constitute a 32-bit vector that governs the basic operation of VM exits.

 12345 // VM-exit Control Bits #define VM_EXIT_IA32E_MODE              0x00000200#define VM_EXIT_ACK_INTR_ON_EXIT        0x00008000#define VM_EXIT_SAVE_GUEST_PAT          0x00040000#define VM_EXIT_LOAD_HOST_PAT           0x00080000

# PIN-Based Execution Control

The pin-based VM-execution controls constitute a 32-bit vector that governs the handling of asynchronous events (for example: interrupts). We’ll use it in the future parts, but for now let define it in our Hypervisor.

 123456 // PIN-Based Execution#define PIN_BASED_VM_EXECUTION_CONTROLS_EXTERNAL_INTERRUPT                 0x00000001#define PIN_BASED_VM_EXECUTION_CONTROLS_NMI_EXITING                         0x00000004#define PIN_BASED_VM_EXECUTION_CONTROLS_VIRTUAL_NMI                         0x00000010#define PIN_BASED_VM_EXECUTION_CONTROLS_ACTIVE_VMX_TIMER                 0x00000020 #define PIN_BASED_VM_EXECUTION_CONTROLS_PROCESS_POSTED_INTERRUPTS        0x00000040

# Interruptibility State

The guest-state area includes the following fields that characterize guest state but which do not correspond to processor registers:
Activity state (32 bits). This field identifies the logical processor’s activity state. When a logical processor is executing instructions normally, it is in the active state. Execution of certain instructions and the occurrence of certain events may cause a logical processor to transition to an inactive state in which it ceases to execute instructions.
The following activity states are defined:
— 0: Active. The logical processor is executing instructions normally.

— 1: HLT. The logical processor is inactive because it executed the HLT instruction.
— 2: Shutdown. The logical processor is inactive because it incurred a triple fault1 or some other serious error.
— 3: Wait-for-SIPI. The logical processor is inactive because it is waiting for a startup-IPI (SIPI).

• Interruptibility state (32 bits). The IA-32 architecture includes features that permit certain events to be blocked for a period of time. This field contains information about such blocking. Details and the format of this field are given in Table below.

# Gathering Machine state for VMCS

In order to configure our Guest-State & Host-State we need to have details about current system state, e.g Global Descriptor Table Address, Interrupt Descriptor Table Add and Read all the Segment Registers.

These functions describe how all of these data can be gathered.

GDT Base :

 123456 Get_GDT_Base PROC    LOCAL   gdtr[10]:BYTE    sgdt    gdtr    mov     rax, QWORD PTR gdtr[2]    retGet_GDT_Base ENDP

CS segment register:

 1234 GetCs PROC    mov     rax, cs    retGetCs ENDP

DS segment register:

 1234 GetDs PROC    mov     rax, ds    retGetDs ENDP

ES segment register:

 1234 GetEs PROC    mov     rax, es    retGetEs ENDP

SS segment register:

 1234 GetSs PROC    mov     rax, ss    retGetSs ENDP

FS segment register:

 1234 GetFs PROC    mov     rax, fs    retGetFs ENDP

GS segment register:

 1234 GetGs PROC    mov     rax, gs    retGetGs ENDP

LDT:

 1234 GetLdtr PROC    sldt    rax    retGetLdtr ENDP

 1234 GetTr PROC    str rax    retGetTr ENDP

Interrupt Descriptor Table:

 1234567 Get_IDT_Base PROC    LOCAL   idtr[10]:BYTE     sidt    idtr    mov     rax, QWORD PTR idtr[2]    retGet_IDT_Base ENDP

GDT Limit:

 1234567 Get_GDT_Limit PROC    LOCAL   gdtr[10]:BYTE     sgdt    gdtr    mov     ax, WORD PTR gdtr[0]    retGet_GDT_Limit ENDP

IDT Limit:

 1234567 Get_IDT_Limit PROC    LOCAL   idtr[10]:BYTE     sidt    idtr    mov     ax, WORD PTR idtr[0]    retGet_IDT_Limit ENDP

RFLAGS:

 12345 Get_RFLAGS PROC    pushfq    pop     rax    retGet_RFLAGS ENDP

# Setting up VMCS

Let’s get down to business (We have a long way to go).

This section starts with defining a function called Setup_VMCS.

 1 BOOLEAN Setup_VMCS(IN PVirtualMachineState vmState, IN PEPTP EPTP);

This function is responsible for configuring all of the options related to VMCS and of course the Guest & Host state.

These task needs a special instruction called “VMWRITE”.

VMWRITE, writes the contents of a primary source operand (register or memory) to a specified field in a VMCS. In VMX root operation, the instruction writes to the current VMCS. If executed in VMX non-root operation, the instruction writes to the VMCS referenced by the VMCS link pointer field in the current VMCS.

The VMCS field is specified by the VMCS-field encoding contained in the register secondary source operand.

Ok, let’s continue with our configuration.

The next step is configuring host Segment Registers.

 1234567 __vmx_vmwrite(HOST_ES_SELECTOR, GetEs() & 0xF8); __vmx_vmwrite(HOST_CS_SELECTOR, GetCs() & 0xF8); __vmx_vmwrite(HOST_SS_SELECTOR, GetSs() & 0xF8); __vmx_vmwrite(HOST_DS_SELECTOR, GetDs() & 0xF8); __vmx_vmwrite(HOST_FS_SELECTOR, GetFs() & 0xF8); __vmx_vmwrite(HOST_GS_SELECTOR, GetGs() & 0xF8); __vmx_vmwrite(HOST_TR_SELECTOR, GetTr() & 0xF8);

Keep in mind, those fields that start with HOST_ are related to the state in which the hypervisor sets whenever a VM-Exit occurs and those which start with GUEST_ are related to to the state in which the hypervisor sets for guest when a VMLAUNCH executed.

The purpose of & 0xF8 is that Intel mentioned that the three less significant bits must be cleared and otherwise it leads to error when you execute VMLAUNCH with Invalid Host State error.

 12 // Setting the link pointer to the required value for 4KB VMCS. __vmx_vmwrite(VMCS_LINK_POINTER, ~0ULL);

The rest of this topic, intends to perform the VMX instructions in the current state of machine, so must of the guest and host configurations should be the same. In the future parts we’ll configure them to a separate guest layout.

Let’s configure GUEST_IA32_DEBUGCTL.

The IA32_DEBUGCTL MSR provides bit field controls to enable debug trace interrupts, debug trace stores, trace messages enable, single stepping on branches, last branch record recording, and to control freezing of LBR stack.

In short : LBR is a mechanism that provides processor with some recording of registers.

We don’t use them but let’s configure them to the current machine’s MSR_IA32_DEBUGCTL and you can see that __readmsr is the intrinsic function for RDMSR.

For configuring TSC you should modify the following values, I don’t have a precise explanation about it, so let them be zeros.

Note that, values that we put Zero on them can be ignored and if you don’t modify them, it’s like you put zero on them.

 123456789101112 /* Time-stamp counter offset */ __vmx_vmwrite(TSC_OFFSET, 0); __vmx_vmwrite(TSC_OFFSET_HIGH, 0);  __vmx_vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); __vmx_vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);  __vmx_vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); __vmx_vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);  __vmx_vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); __vmx_vmwrite(VM_ENTRY_INTR_INFO_FIELD, 0);

This time, we’ll configure Segment Registers and other GDT for our Host (When VM-Exit occurs).

 12345678910 GdtBase = Get_GDT_Base();  FillGuestSelectorData((PVOID)GdtBase, ES, GetEs()); FillGuestSelectorData((PVOID)GdtBase, CS, GetCs()); FillGuestSelectorData((PVOID)GdtBase, SS, GetSs()); FillGuestSelectorData((PVOID)GdtBase, DS, GetDs()); FillGuestSelectorData((PVOID)GdtBase, FS, GetFs()); FillGuestSelectorData((PVOID)GdtBase, GS, GetGs()); FillGuestSelectorData((PVOID)GdtBase, LDTR, GetLdtr()); FillGuestSelectorData((PVOID)GdtBase, TR, GetTr());

Get_GDT_Base is defined above, in the process of gathering information for our VMCS.

FillGuestSelectorData is responsible for setting the GUEST selector, attributes, limit, and base for VMCS. It implemented as below :

 123456789101112131415161718192021 void FillGuestSelectorData( __in PVOID GdtBase, __in ULONG Segreg, __in USHORT Selector){ SEGMENT_SELECTOR SegmentSelector = { 0 }; ULONG            uAccessRights;  GetSegmentDescriptor(&SegmentSelector, Selector, GdtBase); uAccessRights = ((PUCHAR)& SegmentSelector.ATTRIBUTES)[0] + (((PUCHAR)& SegmentSelector.ATTRIBUTES)[1] << 12);  if (!Selector) uAccessRights |= 0x10000;  __vmx_vmwrite(GUEST_ES_SELECTOR + Segreg * 2, Selector); __vmx_vmwrite(GUEST_ES_LIMIT + Segreg * 2, SegmentSelector.LIMIT); __vmx_vmwrite(GUEST_ES_AR_BYTES + Segreg * 2, uAccessRights); __vmx_vmwrite(GUEST_ES_BASE + Segreg * 2, SegmentSelector.BASE); }

The function body for GetSegmentDescriptor :

 123456789101112131415161718192021222324252627282930313233 BOOLEAN GetSegmentDescriptor(IN PSEGMENT_SELECTOR SegmentSelector, IN USHORT Selector, IN PUCHAR GdtBase){ PSEGMENT_DESCRIPTOR SegDesc;  if (!SegmentSelector) return FALSE;  if (Selector & 0x4) { return FALSE; }  SegDesc = (PSEGMENT_DESCRIPTOR)((PUCHAR)GdtBase + (Selector & ~0x7));  SegmentSelector->SEL = Selector; SegmentSelector->BASE = SegDesc->BASE0 | SegDesc->BASE1 << 16 | SegDesc->BASE2 << 24; SegmentSelector->LIMIT = SegDesc->LIMIT0 | (SegDesc->LIMIT1ATTR1 & 0xf) << 16; SegmentSelector->ATTRIBUTES.UCHARs = SegDesc->ATTR0 | (SegDesc->LIMIT1ATTR1 & 0xf0) << 4;  if (!(SegDesc->ATTR0 & 0x10)) { // LA_ACCESSED ULONG64 tmp; // this is a TSS or callgate etc, save the base high part tmp = (*(PULONG64)((PUCHAR)SegDesc + 8)); SegmentSelector->BASE = (SegmentSelector->BASE & 0xffffffff) | (tmp << 32); }  if (SegmentSelector->ATTRIBUTES.Fields.G) { // 4096-bit granularity is enabled for this segment, scale the limit SegmentSelector->LIMIT = (SegmentSelector->LIMIT << 12) + 0xfff; }  return TRUE;}

Also, there is another MSR called IA32_KERNEL_GS_BASE that is used to set the kernel GS base. whenever you run instructions like SYSCALL and enter to the ring 0, you need to change the current GS register and that can be done using SWAPGS. This instruction copies the content of IA32_KERNEL_GS_BASE into the IA32_GS_BASE and now it’s used in the kernel when you want to re-enter user-mode, you should change the user-mode GS Base. MSR_FS_BASE on the other hand, don’t have a kernel base because it used in 32-Bit mode while you have a 64-bit (long mode) kernel.

The GUEST_INTERRUPTIBILITY_INFO & GUEST_ACTIVITY_STATE.

 12 __vmx_vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); __vmx_vmwrite(GUEST_ACTIVITY_STATE, 0);   //Active state

Now we reach to the most important part of our VMCS and it’s the configuration of CPU_BASED_VM_EXEC_CONTROL and SECONDARY_VM_EXEC_CONTROL.

These fields enable and disable some important features of guest, e.g you can configure VMCS to cause a VM-Exit whenever an execution of HLT instruction detected (in Guest). Please check the VM-Execution Controls parts above for a detailed description.

 123 __vmx_vmwrite(CPU_BASED_VM_EXEC_CONTROL, AdjustControls(CPU_BASED_HLT_EXITING | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS, MSR_IA32_VMX_PROCBASED_CTLS)); __vmx_vmwrite(SECONDARY_VM_EXEC_CONTROL, AdjustControls(CPU_BASED_CTL2_RDTSCP /* | CPU_BASED_CTL2_ENABLE_EPT*/, MSR_IA32_VMX_PROCBASED_CTLS2));

As you can see we set CPU_BASED_HLT_EXITING that will cause the VM-Exit on HLT and activate secondary controls using CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.

In the secondary controls, we used CPU_BASED_CTL2_RDTSCP and for now comment CPU_BASED_CTL2_ENABLE_EPT because we don’t need to deal with EPT in this part. In the future parts, I describe using EPT or Extended Page Table that we configured in the 4th part.

The description of PIN_BASED_VM_EXEC_CONTROLVM_EXIT_CONTROLS and VM_ENTRY_CONTROLS is available above but for now, let zero them.

Also, the AdjustControls is defined like this:

 123456789 ULONG AdjustControls(IN ULONG Ctl, IN ULONG Msr){ MSR MsrValue = { 0 };  MsrValue.Content = __readmsr(Msr); Ctl &= MsrValue.High;     /* bit == 0 in high word ==> must be zero */ Ctl |= MsrValue.Low;      /* bit == 1 in low word  ==> must be one  */ return Ctl;}

Next step is setting Control Register for guest and host, we set them to the same value using intrinsic functions.

The next part is setting up IDT and GDT’s Base and Limit for our guest.

 1234 __vmx_vmwrite(GUEST_GDTR_BASE, Get_GDT_Base()); __vmx_vmwrite(GUEST_IDTR_BASE, Get_IDT_Base()); __vmx_vmwrite(GUEST_GDTR_LIMIT, Get_GDT_Limit()); __vmx_vmwrite(GUEST_IDTR_LIMIT, Get_IDT_Limit());

Set the RFLAGS.

 1 __vmx_vmwrite(GUEST_RFLAGS, Get_RFLAGS());

If you want to use SYSENTER in your guest then you should configure the following MSRs. It’s not important to set these values in x64 Windows because Windows doesn’t support SYSENTER in x64 versions of Windows, It uses SYSCALL instead and for 32-bit processes, first change the current execution mode to long-mode (using Heaven’s Gate technique) but in 32-bit processors these fields are mandatory.

Don’t forget to configure HOST_FS_BASEHOST_GS_BASEHOST_GDTR_BASEHOST_IDTR_BASEHOST_TR_BASE.

 12345678 GetSegmentDescriptor(&SegmentSelector, GetTr(), (PUCHAR)Get_GDT_Base()); __vmx_vmwrite(HOST_TR_BASE, SegmentSelector.BASE);  __vmx_vmwrite(HOST_FS_BASE, __readmsr(MSR_FS_BASE)); __vmx_vmwrite(HOST_GS_BASE, __readmsr(MSR_GS_BASE));  __vmx_vmwrite(HOST_GDTR_BASE, Get_GDT_Base()); __vmx_vmwrite(HOST_IDTR_BASE, Get_IDT_Base());

The next important part is to set the RIP and RSP of the guest when a VMLAUNCH executes it starts with RIP you configured in this part and RIP and RSP of the host when a VM-Exit occurs. It’s pretty clear that Host RIP should point to a function that is responsible for managing VMX Events based on return code and decide to execute a VMRESUME or turn off hypervisor using VMXOFF.

 123456789 // left here just for test __vmx_vmwrite(0, (ULONG64)VirtualGuestMemoryAddress);     //setup guest sp __vmx_vmwrite(GUEST_RIP, (ULONG64)VirtualGuestMemoryAddress);     //setup guest ip    __vmx_vmwrite(HOST_RSP, ((ULONG64)vmState->VMM_Stack + VMM_STACK_SIZE — 1)); __vmx_vmwrite(HOST_RIP, (ULONG64)VMExitHandler);

HOST_RSP points to VMM_Stack that we allocated above and HOST_RIP points to VMExitHandler (an assembly written function that described below). GUEST_RIP points to VirtualGuestMemoryAddress(the global variable that we configured during EPT initialization) and GUEST_RSP to zero because we don’t put any instruction that uses stack so for a real-world example it should point to writeable different address.

Setting these fields to a Host Address will not cause a problem as long as we have a same CR3 in our guest state so all the addresses are mapped exactly the same as the host.

Done ! Our VMCS is almost ready.

# Checking VMCS Layout

Unfortunatly, checking VMCS Layout is not as straight as the other parts, you have to control all the checklists described in [CHAPTER 26] VM ENTRIES from Intel’s 64 and IA-32 Architectures Software Developer’s Manual including the following sections:

• 26.2 CHECKS ON VMX CONTROLS AND HOST-STATE AREA
• 26.5 EVENT INJECTION
• 26.6 SPECIAL FEATURES OF VM ENTRY
• 26.8 MACHINE-CHECK EVENTS DURING VM ENTRY

The hardest part of this process is when you have no idea about the incorrect part of your VMCS layout or on the other hand when you miss something that eventually causes the failure.

This is because Intel just gives an error number without any further details about what’s exactly wrong in your VMCS Layout.

The errors shown below.

To solve this problem, I created a user-mode application called VmcsAuditor. As its name describes, if you have any error and don’t have any idea about solving the problem then it can be a choice.

Keep in mind that VmcsAuditor is a tool based on Bochs emulator support for VMX so all the checks come from Bochs and it’s not a 100% reliable tool that solves all the problem as we don’t know what exactly happening inside processor but it can be really useful and time saver.

The source code and executable files available on GitHub :

[https://github.com/SinaKarvandi/VMCS-Auditor]

Further description available here.

# VM-Exit Handler

When our guest software exits and give the handle back to the host, its VM-exit reasons can be defined in the following definitions.

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 #define EXIT_REASON_EXCEPTION_NMI       0#define EXIT_REASON_EXTERNAL_INTERRUPT  1#define EXIT_REASON_TRIPLE_FAULT        2#define EXIT_REASON_INIT                3#define EXIT_REASON_SIPI                4#define EXIT_REASON_IO_SMI              5#define EXIT_REASON_OTHER_SMI           6#define EXIT_REASON_PENDING_VIRT_INTR   7#define EXIT_REASON_PENDING_VIRT_NMI    8#define EXIT_REASON_TASK_SWITCH         9#define EXIT_REASON_CPUID               10#define EXIT_REASON_GETSEC              11#define EXIT_REASON_HLT                 12#define EXIT_REASON_INVD                13#define EXIT_REASON_INVLPG              14#define EXIT_REASON_RDPMC               15#define EXIT_REASON_RDTSC               16#define EXIT_REASON_RSM                 17#define EXIT_REASON_VMCALL              18#define EXIT_REASON_VMCLEAR             19#define EXIT_REASON_VMLAUNCH            20#define EXIT_REASON_VMPTRLD             21#define EXIT_REASON_VMPTRST             22#define EXIT_REASON_VMREAD              23#define EXIT_REASON_VMRESUME            24#define EXIT_REASON_VMWRITE             25#define EXIT_REASON_VMXOFF              26#define EXIT_REASON_VMXON               27#define EXIT_REASON_CR_ACCESS           28#define EXIT_REASON_DR_ACCESS           29#define EXIT_REASON_IO_INSTRUCTION      30#define EXIT_REASON_MSR_READ            31#define EXIT_REASON_MSR_WRITE           32#define EXIT_REASON_INVALID_GUEST_STATE 33#define EXIT_REASON_MSR_LOADING         34#define EXIT_REASON_MWAIT_INSTRUCTION   36#define EXIT_REASON_MONITOR_TRAP_FLAG   37#define EXIT_REASON_MONITOR_INSTRUCTION 39#define EXIT_REASON_PAUSE_INSTRUCTION   40#define EXIT_REASON_MCE_DURING_VMENTRY  41#define EXIT_REASON_TPR_BELOW_THRESHOLD 43#define EXIT_REASON_APIC_ACCESS         44#define EXIT_REASON_ACCESS_GDTR_OR_IDTR 46#define EXIT_REASON_ACCESS_LDTR_OR_TR   47#define EXIT_REASON_EPT_VIOLATION       48#define EXIT_REASON_EPT_MISCONFIG       49#define EXIT_REASON_INVEPT              50#define EXIT_REASON_RDTSCP              51#define EXIT_REASON_VMX_PREEMPTION_TIMER_EXPIRED     52#define EXIT_REASON_INVVPID             53#define EXIT_REASON_WBINVD              54#define EXIT_REASON_XSETBV              55#define EXIT_REASON_APIC_WRITE          56#define EXIT_REASON_RDRAND              57#define EXIT_REASON_INVPCID             58#define EXIT_REASON_RDSEED              61#define EXIT_REASON_PML_FULL            62#define EXIT_REASON_XSAVES              63#define EXIT_REASON_XRSTORS             64#define EXIT_REASON_PCOMMIT             65

VMX Exit handler should be a pure assembly function because calling a compiled function needs some preparing and some register modification and the most important thing in VMX Handler is saving the registers state so that you can continue, other time.

I create a sample function for saving the registers and returning the state but in this function we call another C function.

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061 PUBLIC VMExitHandler  EXTERN MainVMExitHandler:PROCEXTERN VM_Resumer:PROC .code _text VMExitHandler PROC     push r15    push r14    push r13    push r12    push r11    push r10    push r9    push r8            push rdi    push rsi    push rbp    push rbp ; rsp    push rbx    push rdx    push rcx    push rax    mov rcx, rsp ;GuestRegs sub rsp, 28h  ;rdtsc call MainVMExitHandler add rsp, 28h    pop rax    pop rcx    pop rdx    pop rbx    pop rbp ; rsp    pop rbp    pop rsi    pop rdi     pop r8    pop r9    pop r10    pop r11    pop r12    pop r13    pop r14    pop r15   sub rsp, 0100h ; to avoid error in future functions JMP VM_Resumer  VMExitHandler ENDP end

The main VM-Exit handler is a switch-case function that has different decisions over the VMCS VM_EXIT_REASON and EXIT_QUALIFICATION.

In this part, we’re just performing an action over EXIT_REASON_HLT and just print the result and restore the previous state.

From the following code, you can clearly see what event cause the VM-exit. Just keep in mind that some reasons only lead to VM-Exit if the VMCS’s control execution fields (described above) allows for it. For instance, the execution of HLT in guest software will cause VM-Exit if the 7th bit of the Primary Processor-Based VM-Execution Controls allows it.

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293 VOID MainVMExitHandler(PGUEST_REGS GuestRegs){ ULONG ExitReason = 0; __vmx_vmread(VM_EXIT_REASON, &ExitReason);   ULONG ExitQualification = 0; __vmx_vmread(EXIT_QUALIFICATION, &ExitQualification);  DbgPrint(«\nVM_EXIT_REASION 0x%x\n», ExitReason & 0xffff); DbgPrint(«\EXIT_QUALIFICATION 0x%x\n», ExitQualification);   switch (ExitReason) { // // 25.1.2  Instructions That Cause VM Exits Unconditionally // The following instructions cause VM exits when they are executed in VMX non-root operation: CPUID, GETSEC, // INVD, and XSETBV. This is also true of instructions introduced with VMX, which include: INVEPT, INVVPID, // VMCALL, VMCLEAR, VMLAUNCH, VMPTRLD, VMPTRST, VMRESUME, VMXOFF, and VMXON. //  case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMPTRLD: case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD: case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE: case EXIT_REASON_VMXOFF: case EXIT_REASON_VMXON: case EXIT_REASON_VMLAUNCH: { break; } case EXIT_REASON_HLT: { DbgPrint(«[*] Execution of HLT detected… \n»);  // DbgBreakPoint();  // that’s enough for now 😉 Restore_To_VMXOFF_State();  break; } case EXIT_REASON_EXCEPTION_NMI: { break; }  case EXIT_REASON_CPUID: { break; }  case EXIT_REASON_INVD: { break; }  case EXIT_REASON_VMCALL: { break; }  case EXIT_REASON_CR_ACCESS: { break; }  case EXIT_REASON_MSR_READ: { break; }  case EXIT_REASON_MSR_WRITE: { break; }  case EXIT_REASON_EPT_VIOLATION: { break; }  default: { // DbgBreakPoint(); break;  } }}

# Resume to next instruction

If a VM-Exit occurs (e.g the guest executed a CPUID instruction), the guest RIP remains constant and it’s up to you to change the Guest RIP or not so if you don’t have a special function for managing this situation then you execute a VMRESUME and it’s like an infinite loop of executing CPUID and VMRESUME because you didn’t change the RIP.

In order to solve this problem you have to read a VMCS field called VM_EXIT_INSTRUCTION_LEN that stores the length of the instruction that caused the VM-Exit so you have to first, read the GUEST current RIP, second the VM_EXIT_INSTRUCTION_LEN and third add it to GUEST RIP. Now your GUEST RIP points to the next instruction and you’re good to go.

The following function is for this purpose.

 12345678910111213 VOID ResumeToNextInstruction(VOID){ PVOID ResumeRIP = NULL; PVOID CurrentRIP = NULL; ULONG ExitInstructionLength = 0;  __vmx_vmread(GUEST_RIP, &CurrentRIP); __vmx_vmread(VM_EXIT_INSTRUCTION_LEN, &ExitInstructionLength);  ResumeRIP = (PCHAR)CurrentRIP + ExitInstructionLength;  __vmx_vmwrite(GUEST_RIP, (ULONG64)ResumeRIP);}

# VMRESUME

VMRESUME is like VMLAUNCH but it’s used in order to resume the Guest.

• VMLAUNCH fails if the launch state of current VMCS is not “clear”. If the instruction is successful, it sets the launch state to “launched.”
• VMRESUME fails if the launch state of the current VMCS is not “launched.”

So it’s clear that if you executed VMLAUNCH before, then you can’t use it anymore to resume to the Guest code and in this condition VMRESUME is used.

The following code is the implementation of VMRESUME.

 12345678910111213141516 VOID VM_Resumer(VOID){  __vmx_vmresume();  // if VMRESUME succeed will never be here !  ULONG64 ErrorCode = 0; __vmx_vmread(VM_INSTRUCTION_ERROR, &ErrorCode); __vmx_off(); DbgPrint(«[*] VMRESUME Error : 0x%llx\n», ErrorCode);  // It’s such a bad error because we don’t where to go ! // prefer to break DbgBreakPoint();}

# Let’s Test it !

Well, we have done with configuration and now its time to run our driver using OSR Driver Loader, as always, first you should disable driver signature enforcement then run your driver.

As you can see from the above picture (in launching VM area), first we set the current logical processor to 0, next we clear our VMCS status using VMCLEAR instruction then we set up our VMCS layout and finally execute a VMLAUNCH instruction.

Now, our guest code is executed and as we configured our VMCS to exit on the execution of HLT(CPU_BASED_HLT_EXITING), so it’s successfully executed and our VM-EXIT handler function called, then it calls the main VM-Exit handler and as the VMCS exit reason is 0xc (EXIT_REASON_HLT), our VM-Exit handler detects an execution of HLT in guest and now it captures the execution.

After that our machine state saving mechanism executed and we successfully turn off hypervisor using VMXOFF and return to the first caller with a successful (RAX = 1) status.

That’s it ! Wasn’t it easy ?!

# Conclusion

In this part, we get familiar with configuring Virtual Machine Control Structure and finally run our guest code. The future parts would be an enhancement to this configuration like entering protected-mode,interrupt injectionpage modification logging, virtualizing the current machine and so on thus making sure to visit the blog more frequently for future parts and if you have any question or problem you can use the comments section below.

# References

[1] Vol 3C – Chapter 24 – (VIRTUAL MACHINE CONTROL STRUCTURES) (https://software.intel.com/en-us/articles/intel-sdm)

[2] Vol 3C – Chapter 26 – (VM ENTRIES) (https://software.intel.com/en-us/articles/intel-sdm)

[3] Segmentation (https://wiki.osdev.org/Segmentation)

[4] x86 memory segmentation (https://en.wikipedia.org/wiki/X86_memory_segmentation)

[5] VmcsAuditor – A Bochs-Based Hypervisor Layout Checker (https://rayanfam.com/topics/vmcsauditor-a-bochs-based-hypervisor-layout-checker/)

[6] Rohaaan/Hypervisor For Beginners (https://github.com/rohaaan/hypervisor-for-beginners)

[7] SWAPGS — Swap GS Base Register (https://www.felixcloutier.com/x86/SWAPGS.html)

[8] Knockin’ on Heaven’s Gate – Dynamic Processor Mode Switching (http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/)

## Hypervisor From Scratch – Part 4: Address Translation Using Extended Page Table (EPT)

Welcome to the fourth part of the “Hypervisor From Scratch”. This part is primarily about translating guest address through Extended Page Table (EPT) and its implementation. We also see how shadow tables work and other cool stuff.

First of all, make sure to read the earlier parts before reading this topic as these parts are really dependent on each other also you should have a basic understanding of paging mechanism and how page tables work. A good article is here for paging tables.

Most of this topic derived from  Chapter 28 – (VMX SUPPORT FOR ADDRESS TRANSLATION) available at Intel 64 and IA-32 architectures software developer’s manual combined volumes 3.

The full source code of this tutorial is available on GitHub :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch]

Before starting, I should give my thanks to Petr Beneš, as this part would never be completed without his help.

# Introduction

Second Level Address Translation (SLAT) or nested paging, is an extended layer in the paging mechanism that is used to map hardware-based virtualization virtual addresses into the physical memory.

AMD implemented SLAT through the Rapid Virtualization Indexing (RVI) technology known as Nested Page Tables (NPT) since the introduction of its third-generation Opteron processors and microarchitecture code name BarcelonaIntel also implemented SLAT in Intel® VT-x technologiessince the introduction of microarchitecture code name Nehalem and its known as Extended Page Table (EPT) and is used in  Core i9, Core i7, Core i5, and Core i3 processors.

ARM processors also have some kind of implementation known as known as Stage-2 page-tables.

There are two methods, the first one is Shadow Page Tables and the second one is Extended Page Tables.

# Software-assisted paging (Shadow Page Tables)

Shadow page tables are used by the hypervisor to keep track of the state of physical memory in which the guest thinks that it has access to physical memory but in the real world, the hardware prevents it to access hardware memory otherwise it will control the host and it is not what it intended to be.

In this case, VMM maintains shadow page tables that map guest-virtual pages directly to machine pages and any guest modifications to V->P tables synced to VMM V->M shadow page tables.

By the way, using Shadow Page Table is not recommended today as always lead to VMM traps (which result in a vast amount of VM-Exits) and losses the performance due to the TLB flush on every switch and another caveat is that there is a memory overhead due to shadow copying of guest page tables.

# Hardware-assisted paging (Extended Page Table)

To reduce the complexity of Shadow Page Tables and avoiding the excessive vm-exits and reducing the number of TLB flushes, EPT, a hardware-assisted paging strategy implemented to increase the performance.

According to a VMware evaluation paper: “EPT provides performance gains of up to 48% for MMU-intensive benchmarks and up to 600% for MMU-intensive microbenchmarks”.

EPT implemented one more page table hierarchy, to map Guest-Virtual Address to Guest-Physical address which is valid in the main memory.

In EPT,

• One page table is maintained by guest OS, which is used to generate the guest-physical address.
• The other page table is maintained by VMM, which is used to map guest physical address to host physical address.

so for each memory access operation, EPT MMU directly gets the guest physical address from the guest page table and then gets the host physical address by the VMM mapping table automatically.

# Extended Page Table vs Shadow Page Table

EPT:

• Appropriate to programs that have a large amount of page table miss when executing
• Less chance to exit VM (less context switch)
• Two-layer EPT
• Means each access needs to walk two tables
• Easier to develop
• Many particular registers
• Hardware helps guest OS to notify the VMM

SPT:

• Only walk when SPT entry miss
• Appropriate to programs that would access only some addresses frequently
• Every access might be intercepted by VMM (many traps)
• One reference
• Fast and convenient when page hit
• Hard to develop
• Two-layer structure
• Complicated reverse map
• Permission emulation

# Detecting Support for EPT, NPT

If you want to see whether your system supports EPT on Intel processor or NPT on AMD processor without using assembly (CPUID), you can download coreinfo.exe from Sysinternals, then run it. The last line will show you if your processor supports EPT or NPT.

# EPT Translation

EPT defines a layer of address translation that augments the translation of linear addresses.

The extended page-table mechanism (EPT) is a feature that can be used to support the virtualization of physical memory. When EPT is in use, certain addresses that would normally be treated as physical addresses (and used to access memory) are instead treated as guest-physical addresses. Guest-physical addresses are translated by traversing a set of EPT paging structures to produce physical addresses that are used to access memory.

EPT is used when the “enable EPT” VM-execution control is 1. It translates the guest-physical addresses used in VMX non-root operation and those used by VM entry for event injection.

EPT translation is exactly like regular paging translation but with some minor differences. In paging, the processor translates Virtual Address to Physical Address while in EPT translation you want to translate a Guest Virtual Address to Host Physical Address.

If you’re familiar with paging, the 3rd control register (CR3) is the base address of PML4 Table (in an x64 processor or more generally it points to root paging directory), in EPT guest is not aware of EPT Translation so it has CR3 too but this CR3 is used to convert Guest Virtual Address to Guest Physical Address, whenever you find your target Guest Physical Address, it’s EPT mechanism that treats your Guest Physical Address like a virtual address and the EPTP is the CR3

Just think about the above sentence one more time!

So your target physical address should be divided into 4 part, the first 9 bits points to EPT PML4E (note that PML4 base address is in EPTP), the second 9 bits point the EPT PDPT Entry (the base address of PDPT comes from EPT PML4E), the third 9 bits point to EPT PD Entry (the base address of PD comes from EPT PDPTE) and the last 9 bit of the guest physical address point to an entry in EPT PT table (the base address of PT comes form EPT PDE) and now the EPT PT Entry points to the host physical address of the corresponding page.

You might ask, as a simple Virtual to Physical Address translation involves accessing 4 physical address, so what happens ?!

The answer is the processor internally translates all tables physical address one by one, that’s why paging and accessing memory in a guest software is slower than regular address translation. The following picture illustrates the operations for a Guest Virtual Address to Host Physical Address.

If you want to think about x86 EPT virtualization,  assume, for example, that CR4.PAE = CR4.PSE = 0. The translation of a 32-bit linear address then operates as follows:

• Bits 31:22 of the linear address select an entry in the guest page directory located at the guest-physical address in CR3. The guest-physical address of the guest page-directory entry (PDE) is translated through EPT to determine the guest PDE’s physical address.
• Bits 21:12 of the linear address select an entry in the guest page table located at the guest-physical address in the guest PDE. The guest-physical address of the guest page-table entry (PTE) is translated through EPT to determine the guest PTE’s physical address.
• Bits 11:0 of the linear address is the offset in the page frame located at the guest-physical address in the guest PTE. The guest physical address determined by this offset is translated through EPT to determine the physical address to which the original linear address translates.

Note that PAE stands for Physical Address Extension which is a memory management feature for the x86 architecture that extends the address space and PSE stands for Page Size Extension that refers to a feature of x86 processors that allows for pages larger than the traditional 4 KiB size.

In addition to translating a guest-physical address to a host physical address, EPT specifies the privileges that software is allowed when accessing the address. Attempts at disallowed accesses are called EPT violations and cause VM-exits.

Keep in mind that address never translates through EPT, when there is no access. That your guest-physical address is never used until there is access (Read or Write) to that location in memory.

# Implementing Extended Page Table (EPT)

Now that we know some basics, let’s implement what we’ve learned before. Based on Intel manual we should write (VMWRITE) EPTP or Extended-Page-Table Pointer to the VMCS. The EPTP structure described below.

The above tables can be described using the following structure :

 123456789101112 // See Table 24-8. Format of Extended-Page-Table Pointertypedef union _EPTP { ULONG64 All; struct { UINT64 MemoryType : 3; // bit 2:0 (0 = Uncacheable (UC) — 6 = Write — back(WB)) UINT64 PageWalkLength : 3; // bit 5:3 (This value is 1 less than the EPT page-walk length) UINT64 DirtyAndAceessEnabled : 1; // bit 6  (Setting this control to 1 enables accessed and dirty flags for EPT) UINT64 Reserved1 : 5; // bit 11:7 UINT64 PML4Address : 36; UINT64 Reserved2 : 16; }Fields;}EPTP, *PEPTP;

Each entry in all EPT tables is 64 bit long. EPT PML4E and EPT PDPTE and EPT PD are the same but EPT PTE has some minor differences.

An EPT entry is something like this :

Ok, Now we should implement tables and the first table is PML4. The following table shows the format of an EPT PML4 Entry (PML4E).

PML4E can be a structure like this :

 1234567891011121314151617 // See Table 28-1. typedef union _EPT_PML4E { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PML4E, *PEPT_PML4E;

As long as we want to have a 4-level paging, the second table is EPT Page-Directory-Pointer-Table (PDTP), the following picture illustrates the format of PDPTE :

PDPTE’s structure is like this :

 1234567891011121314151617 // See Table 28-3typedef union _EPT_PDPTE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PDPTE, *PEPT_PDPTE;

For the third table of paging we should implement an EPT Page-Directory Entry (PDE) as described below:

PDE’s structure:

 1234567891011121314151617 // See Table 28-5typedef union _EPT_PDE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PDE, *PEPT_PDE;

The last page is EPT which is described below.

PTE will be :

Note that you have, EPTMemoryType, IgnorePAT, DirtyFlag and SuppressVE in addition to the above pages.

 1234567891011121314151617181920 // See Table 28-6typedef union _EPT_PTE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 EPTMemoryType : 3; // bit 5:3 (EPT Memory type) UINT64 IgnorePAT : 1; // bit 6 UINT64 Ignored1 : 1; // bit 7 UINT64 AccessedFlag : 1; // bit 8 UINT64 DirtyFlag : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved : 4; // bit 51:N UINT64 Ignored3 : 11; // bit 62:52 UINT64 SuppressVE : 1; // bit 63 }Fields;}EPT_PTE, *PEPT_PTE;

There are other types of implementing page walks ( 2 or 3 level paging) and if you set the 7th bit of PDPTE (Maps 1 GB) or the 7th bit of PDE (Maps 2 MB) so instead of implementing 4 level paging (like what we want to do for the rest of the topic) you set those bits but keep in mind that the corresponding tables are different. These tables described in (Table 28-4. Format of an EPT Page-Directory Entry (PDE) that Maps a 2-MByte Page) and (Table 28-2. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page). Alex Ionescu’s SimpleVisor is an example of implementing in this way.

An important note is almost all the above structures have a 36-bit Physical Address which means our hypervisor supports only 4-level paging. It is because every page table (and every EPT Page Table) consist of 512 entries which means you need 9 bits to select an entry and as long as we have 4 level tables, we can’t use more than 36 (4 * 9) bits. Another method with wider address range is not implemented in all major OS like Windows or Linux. I’ll describe EPT PML5E briefly later in this topic but we don’t implement it in our hypervisor as it’s not popular yet!

By the way, N is the physical-address width supported by the processor. CPUID with 80000008H in EAX gives you the supported width in EAX bits 7:0.

Let’s see the rest of the code, the following code is the Initialize_EPTP function which is responsible for allocating and mapping EPTP.

Note that the PAGED_CODE() macro ensures that the calling thread is running at an IRQL that is low enough to permit paging.

 1234 UINT64 Initialize_EPTP(){ PAGED_CODE();        …

First of all, allocating EPTP and put zeros on it.

 1234567 // Allocate EPTP PEPTP EPTPointer = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPTPointer) { return NULL; } RtlZeroMemory(EPTPointer, PAGE_SIZE);

Now, we need a blank page for our EPT PML4 Table.

 1234567 // Allocate EPT PML4 PEPT_PML4E EPT_PML4 = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG); if (!EPT_PML4) { ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PML4, PAGE_SIZE);

And another empty page for PDPT.

 12345678 // Allocate EPT Page-Directory-Pointer-Table PEPT_PDPTE EPT_PDPT = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG); if (!EPT_PDPT) { ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PDPT, PAGE_SIZE);

Of course its true about Page Directory Table.

 12345678910 // Allocate EPT Page-Directory PEPT_PDE EPT_PD = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPT_PD) { ExFreePoolWithTag(EPT_PDPT, POOLTAG); ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PD, PAGE_SIZE);

The last table is a blank page for EPT Page Table.

 1234567891011 // Allocate EPT Page-Table PEPT_PTE EPT_PT = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPT_PT) { ExFreePoolWithTag(EPT_PD, POOLTAG); ExFreePoolWithTag(EPT_PDPT, POOLTAG); ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PT, PAGE_SIZE);

Now that we have all of our pages available, let’s allocate two page (2*4096) continuously because we need one of the pages for our RIP to start and one page for our Stack (RSP). After that, we need two EPT Page Table Entries (PTEs) with permission to executereadwrite. The physical address should be divided by 4096 (PAGE_SIZE) because if we dived a hex number by 4096 (0x1000) 12 digits from the right (which are zeros) will disappear and these 12 digits are for choosing between 4096 bytes.

By the way, we let stack be executable too and that’s because, in a regular VM, we should put RWX to all pages because its the responsibility of internal page tables to set or clear NX bit. We need to change them from EPT Tables for special purposes (e.g intercepting instruction fetch for a special page). Changing from EPT tables will lead to EPT-Violation, in this way we can intercept these events.

The actual need is two page but we need to build page tables inside our guest software thus we allocate up to 10 page.

I’ll explain about intercepting pages from EPT, later in these series.

 123456789101112131415161718192021 // Setup PT by allocating two pages Continuously // We allocate two pages because we need 1 page for our RIP to start and 1 page for RSP 1 + 1 and other paages for paging  const int PagesToAllocate = 10; UINT64 Guest_Memory = ExAllocatePoolWithTag(NonPagedPool, PagesToAllocate * PAGE_SIZE, POOLTAG); RtlZeroMemory(Guest_Memory, PagesToAllocate * PAGE_SIZE);  for (size_t i = 0; i < PagesToAllocate; i++) { EPT_PT[i].Fields.AccessedFlag = 0; EPT_PT[i].Fields.DirtyFlag = 0; EPT_PT[i].Fields.EPTMemoryType = 6; EPT_PT[i].Fields.Execute = 1; EPT_PT[i].Fields.ExecuteForUserMode = 0; EPT_PT[i].Fields.IgnorePAT = 0; EPT_PT[i].Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress( Guest_Memory + ( i * PAGE_SIZE ))/ PAGE_SIZE ); EPT_PT[i].Fields.Read = 1; EPT_PT[i].Fields.SuppressVE = 0; EPT_PT[i].Fields.Write = 1;  }

Note: EPTMemoryType can be either 0 (for uncached memory) or 6 (write-back) memory and as we want our memory to be cacheable so put 6 on it.

The next table is PDE. PDE should point to PTE base address so we just put the address of the first entry from the EPT PTE as the physical address for Page Directory Entry.

 123456789101112 // Setting up PDE EPT_PD->Fields.Accessed = 0; EPT_PD->Fields.Execute = 1; EPT_PD->Fields.ExecuteForUserMode = 0; EPT_PD->Fields.Ignored1 = 0; EPT_PD->Fields.Ignored2 = 0; EPT_PD->Fields.Ignored3 = 0; EPT_PD->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PT) / PAGE_SIZE); EPT_PD->Fields.Read = 1; EPT_PD->Fields.Reserved1 = 0; EPT_PD->Fields.Reserved2 = 0; EPT_PD->Fields.Write = 1;

Next step is mapping PDPT. PDPT Entry should point to the first entry of Page-Directory.

 123456789101112 // Setting up PDPTE EPT_PDPT->Fields.Accessed = 0; EPT_PDPT->Fields.Execute = 1; EPT_PDPT->Fields.ExecuteForUserMode = 0; EPT_PDPT->Fields.Ignored1 = 0; EPT_PDPT->Fields.Ignored2 = 0; EPT_PDPT->Fields.Ignored3 = 0; EPT_PDPT->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PD) / PAGE_SIZE); EPT_PDPT->Fields.Read = 1; EPT_PDPT->Fields.Reserved1 = 0; EPT_PDPT->Fields.Reserved2 = 0; EPT_PDPT->Fields.Write = 1;

The last step is configuring PML4E which points to the first entry of the PTPT.

 123456789101112 // Setting up PML4E EPT_PML4->Fields.Accessed = 0; EPT_PML4->Fields.Execute = 1; EPT_PML4->Fields.ExecuteForUserMode = 0; EPT_PML4->Fields.Ignored1 = 0; EPT_PML4->Fields.Ignored2 = 0; EPT_PML4->Fields.Ignored3 = 0; EPT_PML4->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PDPT) / PAGE_SIZE); EPT_PML4->Fields.Read = 1; EPT_PML4->Fields.Reserved1 = 0; EPT_PML4->Fields.Reserved2 = 0; EPT_PML4->Fields.Write = 1;

We’ve almost done! Just set up the EPTP for our VMCS by putting 0x6 as the memory type (which is write-back) and we walk 4 times so the page walk length is 4-1=3 and PML4 address is the physical address of the first entry in the PML4 table.

I’ll explain about DirtyAndAcessEnabled field later in this topic.

 1234567 // Setting up EPTP EPTPointer->Fields.DirtyAndAceessEnabled = 1; EPTPointer->Fields.MemoryType = 6; // 6 = Write-back (WB) EPTPointer->Fields.PageWalkLength = 3;  // 4 (tables walked) — 1 = 3 EPTPointer->Fields.PML4Address = (VirtualAddress_to_PhysicalAddress(EPT_PML4) / PAGE_SIZE); EPTPointer->Fields.Reserved1 = 0; EPTPointer->Fields.Reserved2 = 0;

and the last step.

 12 DbgPrint(«[*] Extended Page Table Pointer allocated at %llx»,EPTPointer); return EPTPointer;

All the above page tables should be aligned to 4KByte boundaries but as long as we allocate >= PAGE_SIZE (One PFN record) so it’s automatically 4kb-aligned.

Our implementation consist of 4 tables, therefore, the full layout is like this:

# Accessed and Dirty Flags in EPTP

In EPTP, you’ll decide whether enable accessed and dirty flags for EPT or not using the 6th bit of the extended-page-table pointer (EPTP). Setting this flag causes processor accesses to guest paging structure entries to be treated as writes.

For any EPT paging-structure entry that is used during guest-physical-address translation, bit 8 is the accessed flag. For an EPT paging-structure entry that maps a page (as opposed to referencing another EPT paging structure), bit 9 is the dirty flag.

Whenever the processor uses an EPT paging-structure entry as part of the guest-physical-address translation, it sets the accessed flag in that entry (if it is not already set).

Whenever there is a write to a guest-physical address, the processor sets the dirty flag (if it is not already set) in the EPT paging-structure entry that identifies the final physical address for the guest-physical address (either an EPT PTE or an EPT paging-structure entry in which bit 7 is 1).

These flags are “sticky,” meaning that, once set, the processor does not clear them; only software can clear them.

# 5-Level EPT Translation

Intel suggests a new table in translation hierarchy, called PML5 which extends the EPT into a 5-layer table and guest operating systems can use up to 57 bit for the virtual-addresses while the classic 4-level EPT is limited to translating 48-bit guest-physical
addresses. None of the modern OSs use this feature yet.

PML5 is also applying to both EPT and regular paging mechanism.

Translation begins by identifying a 4-KByte naturally aligned EPT PML5 table. It is located at the physical address specified in bits 51:12 of EPTP. An EPT PML5 table comprises 512 64-bit entries (EPT PML5Es). An EPT PML5E is selected using the physical address defined as follows.

• Bits 63:52 are all 0.
• Bits 51:12 are from EPTP.
• Bits 11:3 are bits 56:48 of the guest-physical address.
• Bits 2:0 are all 0.
• Because an EPT PML5E is identified using bits 56:48 of the guest-physical address, it controls access to a 256-TByte region of the linear address space.

The only difference is you should put PML5 physical address instead of the PML4 address in EPTP.

# Invalidating Cache (INVEPT)

Well, Intel’s explanation about Cache invalidating is really vague and I couldn’t understand it completely but I asked Petr and he explains me in this way:

• VMX-specific TLB-management instructions:
• INVEPT – Invalidate cached Extended Page Table (EPT) mappings in the processor to synchronize address translation in virtual machines with memory-resident EPT pages.
• INVVPID – Invalidate cached mappings of address translation based on the Virtual Processor ID (VPID).

Imagine we access guest-physical-address 0x1000,it’ll get translated to host-physical-address 0x5000. Next time, if we access 0x1000, the CPU won’t send the request to the memory bus but uses cached memory instead. it’s faster. Now let’s say we change EPT_PDPT->PhysicalAddress to point to different EPT PD or change the attributes of one of the EPT tables, now we have to tell the processor that your cache is invalid and that’s what exactly INVEPT performs.

Now we have two terms here, Single-Context and All-Context.

Single-Context means, that you invalidate all EPT-derived translations based on a single EPTP (in short: for single VM).

All-Context means that you invalidate all EPT-derived translations. (for every-VM).

So in case if you wouldn’t perform INVEPT after changing EPT’s structures, you would be risking that the CPU would reuse old translations.

Basically, any change to EPT structure needs INVEPT but switching EPT (or VMCS) doesn’t need INVEPT because that translation will be “tagged” with the changed EPTP in the cache.

The following assembly function is responsible for INVEPT.

 12345678910111213 INVEPT_Instruction PROC PUBLIC        invept  rcx, oword ptr [rdx]        jz @jz        jc @jc        xor     rax, rax        ret @jz:    mov     rax, VMX_ERROR_CODE_FAILED_WITH_STATUS        ret @jc:    mov     rax, VMX_ERROR_CODE_FAILED        retINVEPT_Instruction ENDP

Note that VMX_ERROR_CODE_FAILED_WITH_STATUS and VMX_ERROR_CODE_FAILED define like this.

 123 VMX_ERROR_CODE_SUCCESS              = 0    VMX_ERROR_CODE_FAILED_WITH_STATUS   = 1    VMX_ERROR_CODE_FAILED               = 2

Now, we implement INVEPT.

 12345678910 unsigned char INVEPT(UINT32 type, INVEPT_DESC* descriptor){ if (!descriptor) { static INVEPT_DESC zero_descriptor = { 0 }; descriptor = &zero_descriptor; }  return INVEPT_Instruction(type, descriptor);}

To invalidate all the contexts use the following function.

 1234 unsigned char INVEPT_ALL_CONTEXTS(){ return INVEPT(all_contexts ,NULL);}

And the last step is for Single-Context INVEPT which needs an EPTP.

 12345 unsigned char INVEPT_SINGLE_CONTEXT(EPTP ept_pointer){ INVEPT_DESC descriptor = { ept_pointer, 0 }; return INVEPT(single_context, &descriptor);}

Using the above functions in a modification state, tell the processor to invalidate its cache.

# Conclusion

In this part, we see how to initialize the Extended Page Table and map guest physical address to host physical address then we build the EPTP based on the allocated addresses.

The future part would be about building the VMCS and implementing other VMX instructions. Don’t forget to check the blog for the future posts.

Have a good time!

# References

[1] Vol 3C – 28.2 THE EXTENDED PAGE TABLE MECHANISM (EPT) (https://software.intel.com/en-us/articles/intel-sdm)

[2] Performance Evaluation of Intel EPT Hardware Assist (https://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf)

[4] Memory Virtualization (http://www.cs.nthu.edu.tw/~ychung/slides/Virtualization/VM-Lecture-2-2-SystemVirtualizationMemory.pptx)  [5] Best Practices for Paravirtualization Enhancements from Intel® Virtualization Technology: EPT and VT-d (https://software.intel.com/en-us/articles/best-practices-for-paravirtualization-enhancements-from-intel-virtualization-technology-ept-and-vt-d)[6] 5-Level Paging and 5-Level EPT (https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf) [7] Xen Summit November 2007 – Jun Nakajima (http://www-archive.xenproject.org/files/xensummit_fall07/12_JunNakajima.pdf) [8] gipervizor against rutkitov: as it works (http://developers-club.com/posts/133906/) [9] Intel SGX Explained (https://www.semanticscholar.org/paper/Intel-SGX-Explained-Costan-Devadas/2d7f3f4ca3fbb15ae04533456e5031e0d0dc845a) [10] Intel VT-x (https://github.com/tnballo/notebook/wiki/Intel-VTx) [11] Introduction to IA-32e hardware paging (https://www.triplefault.io/2017/07/introduction-to-ia-32e-hardware-paging.html)

## Hypervisor From Scratch – Part 2: Entering VMX Operation

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

## Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

 12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

 1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

 12345678910 NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

 12345678910111213141516171819202122232425262728293031323334353637383940414243 NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

## IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

 123456789101112131415161718192021222324252627282930 #define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

## IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:

## Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately.

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

 123456789101112131415161718192021222324252627282930 typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

I created the following headers (source.h) for my driver.

 12345678910111213141516171819202122232425262728293031323334 #pragma once#include #include #include  extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.

## The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

 12 prlkd> x nt!kd_Default_Maskfffff801f5211808 nt!Kd_DEFAULT_Mask =

Now change it value to 0xffffffff.

 1 lkd> eb fffff801f5211808 ff ff ff ff

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.

## Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

 12345678910111213141516171819202122232425262728293031323334353637 string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

 1234567891011121314151617181920 bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

 123456789101112131415161718192021222324252627 int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

## Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

•  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
•  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
•  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

## Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

 1 extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

 1234567891011 Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

 1 PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

 123456 NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

 123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.

## Conclusion

In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch/]

## References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

## Hypervisor From Scratch – Part 1: Basic Concepts & Configure Testing Environment

Hello everyone!

Welcome to the first part of a multi-part series of tutorials called “Hypervisor From Scratch”. As the name implies, this course contains technical details to create a basic Virtual Machine based on hardware virtualization. If you follow the course, you’ll be able to create your own virtual environment and you’ll get an understanding of how VMWare, VirtualBox, KVM and other virtualization softwares use processors’ facilities to create a virtual environment.

## Introduction

Both Intel and AMD support virtualization in their modern CPUs. Intel introduced (VT-x technology) that previously codenamed “Vanderpool” on November 13, 2005, in Pentium 4 series. The CPU flag for VT-xcapability is “vmx” which stands for Virtual Machine eXtension.

AMD, on the other hand, developed its first generation of virtualization extensions under the codename “Pacifica“, and initially published them as AMD Secure Virtual Machine (SVM), but later marketed them under the trademark AMD Virtualization, abbreviated AMD-V.

There two types of the hypervisor. The hypervisor type 1 called “bare metal hypervisor” or “native” because it runs directly on a bare metal physical server, a type 1 hypervisor has direct access to the hardware. With a type 1 hypervisor, there is no operating system to load as the hypervisor.

Contrary to a type 1 hypervisor, a type 2 hypervisor loads inside an operating system, just like any other application. Because the type 2 hypervisor has to go through the operating system and is managed by the OS, the type 2 hypervisor (and its virtual machines) will run less efficiently (slower) than a type 1 hypervisor.

Even more of the concepts about Virtualization is the same, but you need different considerations in VT-x and AMD-V. The rest of these tutorials mainly focus on VT-x because Intel CPUs are more popular and more widely used. In my opinion, AMD describes virtualization more clearly in its manuals but Intel somehow makes the readers confused especially in Virtualization documentation.

## Hypervisor and Platform

These concepts are platform independent, I mean you can easily run the same code routine in both Linux or Windows and expect the same behavior from CPU but I prefer to use Windows as its more easily debuggable (at least for me.) but I try to give some examples for Linux systems whenever needed. Personally, as Linux kernel manages faults like #GP and other exceptions and tries to avoid kernel panic and keep the system up so it’s better for testing something like hypervisor or any CPU-related affair. On the other hand, Windows never tries to manage any unexpected exception and it leads to a blue screen of death whenever you didn’t manage your exceptions, thus you might get lots of BSODs.By the way, you’d better test it on both platforms (and other platforms too.).

At last, I might (and definitely) make mistakes like wrong implementation or misinformation or forget about mentioning some important description so I should say sorry in advance if I make any faults and I’ll be glad for every comment that tells me my mistakes in the technical information or misinformation.

That’s enough, Let’s get started!

## The Tools you’ll need

You should have a Visual Studio with WDK installed. you can get Windows Driver Kit (WDK) here.

The best way to debug Windows and any kernel mode affair is using Windbg which is available in Windows SDK here. (If you installed WDK with default installing options then you probably install WDK and SDK together so you can skip this step.)

You should be able to debug your OS (in this case Windows) using Windbg, more information here.

Hex-rays IDA Pro is an important part of this tutorial.

SysInternals DebugView for printing the DbgPrint() results.

## Creating a Test Environment

Almost all of the codes in this tutorial have to run in kernel level and you must set up either a Linux Kernel Module or Windows Driver Kit (WDK) module. As configuring VMM involves lots of assembly code, you should know how to run inline assembly within you kernel project. In Linux, you shouldn’t do anything special but in the case of  Windows, WDK no longer supports inline assembly in an x64 environment so if you didn’t work on this problem previously then you might have struggle creating a simple x64 inline project but don’t worry in one of my post I explained it step by step so I highly recommend seeing this topic to solve the problem before continuing the rest of this part.

Now its time to create a driver!

There is a good article here if you want to start with Windows Driver Kit (WDK).

The whole driver is this :

 123456789101112131415161718192021222324252627282930313233343536373839404142434445 #include #include #include  extern void inline AssemblyFunc1(void);extern void inline AssemblyFunc2(void); VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, Example_Unload) NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath){ NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«DriverEntry Called.»);  RtlInitUnicodeString(&usDriverName, L»\Device\MyHypervisor»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject);  if (NtStatus == STATUS_SUCCESS) { pDriverObject->DriverUnload = DrvUnload; pDeviceObject->Flags |= IO_TYPE_DEVICE; pDeviceObject->Flags &= (~DO_DEVICE_INITIALIZING); IoCreateSymbolicLink(&usDosDeviceName, &usDriverName); } return NtStatus;} VOID DrvUnload(PDRIVER_OBJECT  DriverObject){ UNICODE_STRING usDosDeviceName; DbgPrint(«DrvUnload Called rn»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»); IoDeleteSymbolicLink(&usDosDeviceName); IoDeleteDevice(DriverObject->DeviceObject);}

AssemblyFunc1 and AssemblyFunc2 are two external functions that defined as inline x64 assembly code.

Our driver needs to register a device so that we can communicate with our virtual environment from User-Mode code, on the hand, I defined DrvUnload which use PnP Windows driver feature and you can easily unload your driver and remove device then reload and create a new device.

The following code is responsible for creating a new device :

 123456789101112 RtlInitUnicodeString(&usDriverName, L»\Device\MyHypervisor»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject);  if (NtStatus == STATUS_SUCCESS) { pDriverObject->DriverUnload = DrvUnload; pDeviceObject->Flags |= IO_TYPE_DEVICE; pDeviceObject->Flags &= (~DO_DEVICE_INITIALIZING); IoCreateSymbolicLink(&usDosDeviceName, &usDriverName); }

If you use Windows, then you should disable Driver Signature Enforcement to load your driver, that’s because Microsoft prevents any not verified code to run in Windows Kernel (Ring 0).

To do this, press and hold the shift key and restart your computer. You should see a new Window, then

2. On the new Window Click Startup Settings.
3. Click on Restart.
4. On the Startup Settings screen press 7 or F7 to disable driver signature enforcement.

The latest thing I remember is enabling Windows Debugging messages through registry, in this way you can get DbgPrint() results through SysInternals DebugView.

Just perform the following steps:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

## Some thoughts before the start

There are some keywords that will be frequently used in the rest of these series and you should know about them (Most of the definitions derived from Intel software developer’s manual, volume 3C).

Virtual Machine Monitor (VMM): VMM acts as a host and has full control of the processor(s) and other platform hardware. A VMM is able to retain selective control of processor resources, physical memory, interrupt management, and I/O.

Guest Software: Each virtual machine (VM) is a guest software environment.

VMX Root Operation and VMX Non-root Operation: A VMM will run in VMX root operation and guest software will run in VMX non-root operation.

VMX transitions: Transitions between VMX root operation and VMX non-root operation.

VM entries: Transitions into VMX non-root operation.

Extended Page Table (EPT): A modern mechanism which uses a second layer for converting the guest physical address to host physical address.

VM exits: Transitions from VMX non-root operation to VMX root operation.

Virtual machine control structure (VMCS): is a data structure in memory that exists exactly once per VM, while it is managed by the VMM. With every change of the execution context between different VMs, the VMCS is restored for the current VM, defining the state of the VM’s virtual processor and VMM control Guest software using VMCS.

The VMCS consists of six logical groups:

•  Guest-state area: Processor state saved into the guest state area on VM exits and loaded on VM entries.
•  Host-state area: Processor state loaded from the host state area on VM exits.
•  VM-execution control fields: Fields controlling processor operation in VMX non-root operation.
•  VM-exit control fields: Fields that control VM exits.
•  VM-entry control fields: Fields that control VM entries.
•  VM-exit information fields: Read-only fields to receive information on VM exits describing the cause and the nature of the VM exit.

I found a great work which illustrates the VMCS, The PDF version is also available here

Don’t worry about the fields, I’ll explain most of them clearly in the later parts, just remember VMCS Structure varies between different version of a processor.

## VMX Instructions

VMX introduces the following new instructions.

Intel/AMD MnemonicDescription
INVEPTInvalidate Translations Derived from EPT
INVVPIDInvalidate Translations Based on VPID
VMCALLCall to VM Monitor
VMCLEARClear Virtual-Machine Control Structure
VMFUNCInvoke VM function
VMLAUNCHLaunch Virtual Machine
VMRESUMEResume Virtual Machine
VMPTRLDLoad Pointer to Virtual-Machine Control Structure
VMPTRSTStore Pointer to Virtual-Machine Control Structure
VMWRITEWrite Field to Virtual-Machine Control Structure
VMXOFFLeave VMX Operation
VMXONEnter VMX Operation

### Life Cycle of VMM Software

• The following items summarize the life cycle of a VMM and its guest software as well as the interactions between them:
• Software enters VMX operation by executing a VMXON instruction.
• Using VM entries, a VMM can then turn guests into VMs (one at a time). The VMM effects a VM entry using instructions VMLAUNCH and VMRESUME; it regains control using VM exits.
• VM exits transfer control to an entry point specified by the VMM. The VMM can take action appropriate to the cause of the VM exit and can then return to the VM using a VM entry.
• Eventually, the VMM may decide to shut itself down and leave VMX operation. It does so by executing the VMXOFF instruction.

That’s enough for now!

In this part, I explained about general keywords that you should be aware and we create a simple lab for our future tests. In the next part, I will explain how to enable VMX on your machine using the facilities we create above, then we survey among the rest of the virtualization so make sure to check the blog for the next part.

## References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] Hardware-assisted Virtualization (http://www.cs.cmu.edu/~412/lectures/L04_VTx.pdf)

[3] Writing Windows Kernel Driver (https://resources.infosecinstitute.com/writing-a-windows-kernel-driver/)

[4] What Is a Type 1 Hypervisor? (http://www.virtualizationsoftware.com/type-1-hypervisors/)

[5] Intel / AMD CPU Internals (https://github.com/LordNoteworthy/cpu-internals)

[7] Instruction Set Mapping » VMX Instructions (https://docs.oracle.com/cd/E36784_01/html/E36859/gntbx.html)

## A look under the hood of a decentralised VPN Application.

MysteriumVPN is the client application of Mysterium Network, a project focused on providing security and privacy to web 3 applications.

In this article, we will discuss the architecture of MysteriumVPN and how it integrates with Mysterium Node to ensure an encrypted end to end flow of data through Mysterium Network.

#### Cross-platform architecture

Usually, you need separate builds for each platform. Now that cross-platform technology has improved, this is no longer the case.

#### For desktop:

Electron is a framework which allows us to build cross-platform applications using common web technologies such as HTMLCSS and Javascript. We are using Electron which allows us to develop one application for two platforms for desktop — Windows and Mac OSLinux coming soon. Download our alpha.

Under the hood of an Electron application, sits a Chromium browser; A website, rendered by an embedded browser.

#### For mobile:

We are kicking off our mobile development for MysteriumVPN, with Android versions set to release shortly.

For this, we are using React Native for cross-platform applications.

Most of MysteriumVPN is written in Javascript, which is run in a separate process. Javascript generates the virtual structure of the user interface. This Javascript process communicates to native mobile processes which are responsible for rendering the actual user interface as you see it.

#### How MysteriumVPN works on desktop:

Since we are using Electron, we have two processes, MAIN and RENDERER.

MAIN is the first process which is started when the application starts. It is a NodeJS process which is responsible for managing the following functions:

• Application state and internal operations
• Tray
• Kicking off the RENDERER process

The second process is RENDERER and it is responsible for displaying the graphical user interface for the application.

#### Communication between processes:

Both the MAIN and RENDERER processes need to communicate with each other to stay in sync. For this reason, we are using a standard approach of Inter-Process Communication (IPC).

Javascript is not type-safe, which isn’t very reliable. We use Flow static type checker which adds type-safety for Javascript. This especially applies to syncing data between processes — it becomes less reliable when using out-of-the-box IPC. To improve that, with custom implementation on top to have type-safety.

MessageTransport describes a single typed message which is sent between processes. It creates alignment between both processes by introducing sender and receiver objects, ensuring that both sides expect the same arguments of this message.

Here is an implementation:

class MessageTransport<T> {
_channel: string
_messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
this._channel = channel
this._messageBus = messageBus
}
buildSender (): MessageSender<T> {
return new MessageSender(this._channel, this._messageBus)
}
buildReceiver (): MessageReceiver<T> {
}
}
class MessageSender<T> {
_channel: string
_messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
this._channel = channel
this._messageBus = messageBus
}
send (data: T) {
this._messageBus.send(this._channel, data)
}
}
class MessageReceiver<T> {
_channel: string
_messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
this._channel = channel
this._messageBus = messageBus
}
on (callback: T => void) {
this._messageBus.on(this._channel, callback)
}
removeCallback (callback: T => void) {
this._messageBus.removeCallback(this._channel, callback)
}
}

Here is an example of communication between both these MAIN and RENDERER processes:

Example: communicating country proposal updates between processes:

MAIN process is managing country proposals internally and it sends all updates:

this._countryList.onUpdate(countries => {
this._communication.countryUpdate.send(countries)
})

RENDERER process listens for country updates,

this.rendererCommunication.countryUpdate.on(this.onCountriesUpdate)
...
onCountriesUpdate (countries) {
this.countryList = countries
}

Having such an abstraction layer ensures that communication is type-safe, reliable and features around it are simple to test.

#### How do we integrate Mysterium Node with MysteriumVPN Application?

Once we’ve rendered the application layer, we still need to connect MysteriumVPN to Mysterium NodeMysterium Nodeis a software that connects you to Mysterium Network where you are able to exchange value for bandwidth.

MysteriumVPN is a client application of Mysterium Network. The successful running of our dVPN on the network will attract other use cases from existing or future businesses that require end-to-end encryption of data, thereby expanding Mysterium Network’s ecosystem.

We require specific information to ensure the successful running of our dVPNservice.

Operation System Service
Since we are running Mysterium Node under the MysteriumVPN application we need to supervise the Mysterium Node to ensure that it works.

Our Data Protection Policy
We make a clear distinction between personal data and usage data. We do not collect information on who you are. We collect data on session and connection inputs and outputs. This is important data for us as it gives us visibility on how our technology fares against the realities of cyber oppression. Check out our privacy policy for more information.

Logging
Since we are integrating Mysterium Node into the MysteriumVPN application, the application itself gets quite complex. That’s why we have to be prepared to log errors from everywhere, — our application, Mysterium Node, and from Electron.

That means that there are three sources of inputs. When we are inspecting something, we need to understand that these errors can happen in three different places. We need to synchronise those and collect all relevant data from these sources.

Data management in the era of web 3 is complex and we hope to do so in an ethical and fair manner. Check out how our no logs policy protects your personal data.

#### Build on Mysterium Network

We have an npm package that allows for you to connect to Mysterium Nodeeasily. This is the same package that the MysteriumVPN uses to connect to Mysterium Network. This can be used for any application — it’s literally plug and play.

Interested in contributing to Mysterium Network? We are an open source project focused on bringing privacy, security and freedom to web 3. Check out our Github.

## Malware on Steroids Part 3: Machine Learning & Sandbox Evasion

It’s been a busy month for me and I was not able to save time to write the final part of the series on Malware Development. But I am receiving too many DMs on Twitter accounts lately to publish the final part. So here we are.

If you are reading this blog, I am basically assuming that you know C/C++ and Windows API by now. If you don’t, then you should go back and read my other blogs on Static AV Evasion and Malware Development using WINAPI (basics).

In this post, we will be using multiple ways to evade endpoint detection mechanisms and sandboxes. Machine Learning is applied at two major levels in most organization. One is at the network level where it tries to identify anomalies based on the behavior of network connections, proxy logs and pattern of connections over time. Most Network ML Solutions tend to analyze beacons of malwares and DPI (deep packet inspection) to identify the malware. This is something that Microsoft ATA (Advanced Threat Analytics), or FireEye sandboxes do. On the other hand, we have Endpoint agents like Symantec EP, Crowdstrike, Endgame, Microsoft Cloud Defender and similar monitoring tools which perform behavioral analysis of the code along with signature detection to detect malicious processes.

I will purely be focusing on multiple ways where we can make our malware behave like a legitimate executable or try to confuse the Endpoint agent to evade detection. I’ve used the methods mentioned in this blog to successfully evade Crowdstrike Agent, Symantec EP and Microsoft Windows Cloud Defender, the videos of the latter which I have already posted in my previous blogs. However, you might need to modify or add new techniques as this might become detectable over time. One of the best ways to avoid AV is to disable the Process creation altogether and just use WINAPI. But that would mean carefully crafting your payloads and it would be difficult to port them for shellcoding. That’s the main reason malware authors write their malwares in C, and only selected payloads in shellcode. A combination of these two makes malwares unbeatable on all fronts.

Each of the techniques mentioned below creates a unique signature which most AVs won’t have. It’s more of a trail and error to check which AVs detect which techniques. Also remember that we can use stubs and packers for encryption, but that’s for a different blog post that I will do later.

P.S.: This blog is exclusive of shellcodes, reason being I will be writing a separate blog series on windows Shellcoding later. I will be using encrypted functions during the shellcoding part and not in this post. This post is specifically how Malware authors use C to perform evasions. You can also use the same APIs and code snippets mentioned below to craft a custom malware for Red Teaming.

# main():

So, before we start let’s try to get a based understanding of how Machine learning works. Machine learning is purely focused on the behaviour of the user (in case of endpoints). In short, if we sign our malware and try to make it act like a legitimate executable, it becomes really easy to evade ML. I’ve seen people using PowerShell to write reverse shells, but they get easy detectable due to Microsoft’s AMSI (Anti-Malware Scan Interface) which consistently keeps on checking (including and mainly PowerShell) to detect malicious process executions and connections.  For those of you who don’t know, Microsoft uses DMTK(Microsoft Distributed Machine Learning Toolkit) framework which is basically a decision tree based algorithm which specifies whether a file is malicious or not. PowerShell is very tightly controlled by Microsoft and it gets harder over time to evade ML when using PowerShell.

This is the reason I decided to switch to C and C++ to get reverse shells over network so that I could have flexibility at a lower level to do whatever I want. We will be using a lot of windows APIs, encrypted variables and a lot of decision tree of our own to evade ML. This it supposed to work till Microsoft doesn’t start using CNTK framework which is a much better framework than DMTK, but harder to apply at the same time.

## Encrypted Host & Process Names

So, the first thing to do is to encrypt our hostname. We can possibly use something as simple as XOR, or any custom complicated mathematical equation to decrypt our encrypted variable to get the hostname. I created a python script which takes a hostname and a character and returns a Xor’d Array:

As you can see, it gives the Key value in integer of the Xor Key, the length of the encrypted array and the whole Encrypted array which we can simply use in a C integer or char array.

The next step is to decrypt this array at runtime and we need to hardcode the key inside the executable. This is the only key that we would be hardcoding into the code. Also, to make it complicated for the reverse engineer, we will write a C function to automatically detect that the last integer is the key and use that to loop through the array to decrypt the encrypted string. Below is how it would look like

So, we are creating a char buffer of the size of EncryptedHost on heap. We are then passing the host, length and decrypted host variable to the Decrypter function. Below is how the Decrypter function looks:

To explain in short, it creates an Encrypted Integer array of our char array  and xors them back again using the key to convert the encrypted value to the original value and stores them in the DecryptedData array we created previously. With the help of this, if someone runs strings, they wouldn’t be able to see any host in the executable. They would need to understand the math and set a proper breakpoint in Debugger to fetch the C2 host. You can create more complicated mathematical equations to decrypt host if required. We can now use this DecryptedData array within our sockets to connect to the remote host.

P.S.: Reverse Engineers & Sandboxes can fetch the C2 names with the help of packet captures and DNS Name Resolutions. It is better to send raw packets to multiple hosts to confuse which one is the real C2 server. But at the same time, this can lead to easy  detection of the malware. Check my Legitimate Domain Routing technique below which is much better than using this.

If you’ve read my previous post, then you know that I created a cmd.exe process using the CreateProcessW winAPI. We can do what we did above for Creating Processes as well. But instead of hardcoding the Encrypted array for the Process to be executed, we will send the process name as an array over network once the executable connects to the C2 Server along with the host. We can also use authentication on C2 server, and only allow it to connect if it sends a proper key. Below is the Code for Creating Processes using Encrypted Char array over sockets

In this way, when a system sandboxes our executable, it won’t know that what process are we executing beforehand inside a sandbox. Below is a much clearer description of what we are doing:

1. Decrypt C2 host at runtime and connect to host
3. If the key is right, wait for 5 seconds to receive encrypted array(process name) over socket
4. Decrypt the received Process and run it using CreateProcessW API

With the help of the above technique, if our C2 is down, then the sandbox/analyst will not be able to find what we are executing since we have not hardcoded any processes to execute.

## Code Signing with Spoofed Certs

I wrote a Script in python which can fetch and create duplicate certificates from any website which we can use for code signing. One thing I noticed is that Antiviruses don’t check and verify the whole chain of the certificate. They don’t even verify the authenticity. The main reason being not every antivirus can connect to internet in every organization to fetch and verify the ceritificates for every third party application installed. You can find the Certificate spoofing python script on my GitHub profile here.

And this is the scan results of Windows ML Defender after Signing:

Next thing is we will try to add a few features to our malware to detect if we are running in a sandbox or inside a virtual machine. We will try to evade Sandboxes as much as possible and kill our executable as soon as we find anything suspicious. We need to make sure that our malware doesn’t even look suspicious. Because if it does, then the sandbox will quarantine it and send an alert that there is a suspicious process running. This is worse than detection because this is where most SOC detects the malware and the Red Teaming gets detected.

## Legitimate Domain Routing (Evade Proxy Categorization Detection and Endpoint Detection)

This is one of the best techniques I’ve found out till date which almost works every time. Let’s say I buy a C2 domain named abc.com. I will modify the A records so that it points to Microsoft.com or some similar legitimate site for a month or so. When the malware executes on the vicim’s system, it will connect to this domain which will send a normal HTTP reply from Microsoft and the malware will go to sleep for a few hours and then loop into doing the same thing. Now whenever I want to get a reverse shell of my malware, I will simply change the A records of abc.com to my C2 hosting server and it will send a key in HTTP to the malware which will trigger it to fetch shellcode or send a shell back to my C2. This way, our abc.com will also get categorized as a legitimate domain instead of malicious or phishing site. And even the Endpoint systems will not block it since it is contacting a legitimate domain. Over time I’ve also used Symantec’s website to connect as a temporary domain, later changing it to my malicious C2 server.

## Check System Uptime & Idletime (Evades Virtual Machine Sandboxes)

If our executable is running in a virtual machine, the uptime will be pretty short since it will boot up, perform analysis on our binary and then shutdown. So, we can check the uptime of the machine and sleep till it reaches 20-30 minutes and then run it. Make sure to use NTP to check the time with external domain, else Sandboxes can fast-forward system time for process executions. Checking via NTP will make sure that correct time is checked. Below is the code to check uptime of a system and also idle time in case required.

Idletime:

Uptime:

## Check Mac Address of Virtual Machine (Known OUIs)

Vmware, Virtual box, MS Hyper-v and a lot of virtual machine providers use a fixed MAC Unique identifier which can be used to run in a loop to check if current mac address matches to any of those mentioned in the list. If it is, then it is highly possible that the malware is running in a virtual environment, mostly for the purpose of sandboxing and reverse engineering. Below are the OUIs that I know for the moment. If there are more, do let me know in the comments.

 Company and Products MAC unique identifier (s) VMware ESX 3, Server, Workstation, Player 00-50-56, 00-0C-29, 00-05-69 Microsoft Hyper-V, Virtual Server, Virtual PC 00-03-FF Parallels Desktop, Workstation, Server, Virtuozzo 00-1C-42 Virtual Iron 4 00-0F-4B Red Hat Xen 00-16-3E Oracle VM 00-16-3E XenSource 00-16-3E Novell Xen 00-16-3E Sun xVM VirtualBox 08-00-27

Below is the C code to detect mac address of a Windows machine:

## Execute shellcode when a specific key is pressed. (Sleep & hook method)

Here, we are only executing our shellcode/malicious process when the user presses a specific key. For this, we can hook the keyboard and create a list of multiple keys that specify what kind of shellcode needs to be executed. This is basically polymorphism. Every time a different shellcode depending on the key will confuse the Antivirus, and secondly in a sandbox, no one presses any key. So, our malware won’t execute in a sandbox. Below is the Code to hook the keyboard and check the key pressed.

P.S.: Below code can also be used for Keylogging 😉

## Check number of files in Temp and Recent Files

Whenever a malware is running in a sandbox, the sandbox will have the minimum number of recent files in the virtual machine reason being sandboxes are not used for usual work. So, we can run a loop to check the number of recent files and also files in temp directory to check if we are running in a virtual machine. If the number of recent files are less than 10-15, just sleep or suspend itself. Below is a code I wrote which loops to check all files and folders in a directory:

Now I can keep on going like this, but the blog will just get lengthier with this. Besides, below are a few things you can code to check if we are running in a sandbox:

1. Check if the hard disk size is greater than 60 GB (Default Virtual Machine Sandbox Size is <100GB)
2. Check if Packet Capture Driver is installed in the registry (To check if Wireshark or similar is running for packet analysis)
3. Check if Virtual Box additions/extension pack is installed
4. WannaCry DNS Sinkhole Method

This is another method which WannaCry used. So basically, the malware will try to connect to a domain that doesn’t exist. If it does, it means the malware is running in a sandbox, since Sandboxes will reply to a NX Domain too to check if that’s a C2 Server. If we get a NX domain in reply, then we can directly connect to the C2 host. BEWARE, that DNS Sinkholes can prevent your malware from executing at all. Instead you can buy a certain domain and check for a customized response to check if you are running in a sandbox environment.

Now, there are much more different ways to evade ML and AV detection and they aren’t really that hard. Evading ML based AVs are not rocket science as people say. It’s just that it requires more of free time to sit and understand how the underlying architecture works and find flaws to evade it.

It’s much better to invest in a highly technical Threat Hunter for detecting suspicious behaviors in your environment’s and logs rather than buying a high-end Sandbox or Antivirus Solution, though the latter is also useful in it’s own sense too.

## Unofficial opensource place-and-route for Xilinx Coolrunner-II CPLDs

Recently, I have been working on an open-source tool for producing bitstreams for Xilinx Coolrunner-II CPLDs that I have named xc2par. It is finally at a point where I no longer feel embarrassed to have other people try and use it.

# Quick start

The first tool you will need is yosys. You need a version that contains commit 14e49fb05737cfb02217b2aaf14d9f5d9e1859da, but it is recommended to use the latest master branch. Follow the instructions in the yosys README to install it (normally I would point people to my pre-built binaries, but those are currently broken).

Next, you will need to install the compiler for the Rust programming language.

After installing Rust, clone the «openfpga» repository. Change directories into the src/xc2par directory and run cargo build --release. When this completes, the final command-line tool will be in target/release/xc2par.

At this point you are ready to try compiling some HDL! Unfortunately, xc2par does not currently accept exactly the same code as the proprietary Xilinx tools, and there is also currently no documentation on what is or is not accepted. You will have to either guess or read the source code to figure out what currently works. Most notably, .ucf constraint files are not supported (you must use LOC attributes, and they must use the function block and macrocell number rather than a package pin), and automatic insertion of BUFGs is not supported. However, to get started, the following is some example Verilog for an LED blinker:

module top(led0, led1, led2, led3, clk_);

// NOTE: Must use LOC attributes rather than a .ucf file
(* LOC = "FB1_9" *)
output led0;
(* LOC = "FB1_10" *)
output led1;
(* LOC = "FB1_11" *)
output led2;
(* LOC = "FB1_12" *)
output led3;
(* LOC = "FB2_5" *)
input clk_;

// NOTE: Must manually instantiate BUFG
wire clk;
BUFG bufg0 (
.I(clk_),
.O(clk),
);

reg [23:0] counter;

assign {led3, led2, led1, led0} = counter[23:20];

always @(posedge clk)
counter <= counter + 1;

endmodule


To compile this code, use the following commands:

yosys -p "synth_coolrunner2 -json blinky.json" blinky.v


At this point, if everything worked correctly, there will be a blinky.jed file in the same directory as the blinky.json file. It should now be possible to program this file into a CPLD to test it out. However, embarrassingly, there currently isn’t any code in the xc2par project that can help with that. You will either need to use

• ISE/iMPACT (either to generate a .svf file or to program parts directly)
• Andrew Zonenberg’s jtaghal (only supports the 32A and 64A parts)
• Some other JTAG programmer that can read Xilinx .jed files

# The Origins of xc2par

In the beginning, there was only Andrew Zonenberg. Back in 2014, Andrew took some CPLDs and exposed them to a slew of nasty chemicals and some high-energy electrons. This culminated in a talk presented at REcon 2015. At this point in time, the PLA (AND and OR gates) and interconnect were understood but the macrocell configuration was not. Although Andrew had always wanted to write an open-source compiler for these chips, yosys did not have support for sum-of-products architectures at this point in time. This combined with Life™ led to the project being temporarily shelved.

# Robert Has Joined the Party

Being a «hacker» with extremely varied interests, I had been lurking Andrew’s work for years (since at least 2012). I had an interest in both programmable logic devices and reverse engineering. However, since I was pretty busy with Life™ of my own, I had never considered ever contacting Andrew or collaborating. However, one of Andrew’s projects combined with perfect timing changed this.

## Quick Detour: Introducing GreenPak

I first heard about these interesting tiny programmable logic devices from an article on Hackaday. An interesting feature of these parts is that the bitstream format is completely documented in the datasheet. After reading the article, I purchased a GreenPak development kit with the idea of making «some kind of domain-specific-language» for programming these parts (I was not ambitious enough to consider supporting a traditional HDL, and I was not yet aware that yosys existed at the time). However, since I was still working on Life™ (being a university student), this project also got shelved.

However, in the meantime, Andrew was working on his open-source Verilog flow for GreenPak. This tool just so happened to be released almost exactly around the time that I finished my undergrad degree. Since I had planned to take a year off to remain «funemployed,» I decided to introduce myself to Andrew and joined the ##openfpga IRC community on Freenode somewhere around this time. Andrew eventually suggested that I could work on a place-and-route for Coolrunner-II parts (yosys support for sum-of-products had also gotten added around this time).

# Finishing the Reverse Engineering

When Andrew had originally set aside the Coolrunner-II toolchain project back in 2015, the macrocell configuration was not yet understood. Andrew suggested that I ignore the macrocell-related parts and write tooling to handle the other parts of the device first. Macrocells would be configured by copying the entire block of bits from an existing ISE-generated design. Dissatisfied with this answer, I decided to schedule an IRL hackathon with Andrew to figure out all of the remaining bits. This resulted in the following:

Note that this diagram is not accurate and has errors. A more accurate diagram needs to be drawn eventually.

Now with all the bits understood, we could begin writing tooling that did not require any strange hacks.

# xc2bit and xc2par are Born

At the end of May 2017, the first commit of the «xc2bit» code is created. This is a Rust library for performing low-level manipulations of Coolrunner-II bitstreams. This is analogous to Andrew’s never-finished libcrowbar library. The first commit of xc2par is created a few days later.

## First Attempt: xbpar

The first version of xc2par was supposed to use xbpar, Andrew’s generic C++ framework for writing place-and-route tools for «crossbar interconnect» architectures. The internals of xbpar are described in Andrew’s blog post about the GreenPak Verilog tooling, but the general idea is that the algorithm takes as input two graphs, one modeling the target device and one modeling the user’s design, and attempts to check if the user’s design graph is a subgraph of the device structure graph. It does this by attempting to «pair up» nodes of the two graphs and then checking if the edges between the nodes of the design graph exist in the device graph. If any edges do not, it attempts to find something better using simulated annealing.

Trying to use this library for xc2par had a number of issues. The first issue was «ideological.» I wanted to use Rust for xc2par because I did not like C++ but wanted more powerful libraries and abstractions than are available in C. Unfortunately, xbpar is a C++ library that isn’t very friendly to being interfaced with a different programming language. Among other things, device-specific functionality is implemented by deriving from an xbpar base class and overriding some methods. However, I persevered and attempted to write glue code between the two languages. Unfortunately, all of this glue code ended up being more lines of code than xbpar itself.

A second issue was that the specific mechanism I had used for describing graphs for xbpar caused extremely poor behavior. I had described every «enable-able or disable-able» feature of the macrocell as an individual node in xbpar (so the XOR gate of a macrocell would be a node, the register would be a different node, and the IO pad would be yet another different node). However, this would cause the following behavior:

1. The greedy initial placement algorithm packs together «the wrong» set of «macrocell pieces.» For example, it would place the IO pad for pin a and the XOR gate for pin b both at function block 1 macrocell 1. Meanwhile, the XOR gate for pin a and the IO pad for pin b both get placed at function block 1 macrocell 2.
2. The algorithm would notice that there are not proper edges between the nodes. In the above example, there is an edge in the device graph between the XOR and IO pad of FB1_1, but there are no edges between the IO pad of FB1_1 with the XOR gate of FB1_2 or between the XOR gate of FB1_1 with the IO pad of FB1_2. Therefore, all components of pins a and b will get added to the list of nodes contributing to the bad scores.
3. The simulated annealing step picks one of the «bad» nodes and tries to move it. It ends up moving e.g. the XOR gate of one pin to a completely different random spot. This does not improve the situation since the XOR gate needs to be in the same location as all of the other «macrocell bits» for the same pin (but it has just been moved somewhere random and unrelated).
4. The cycle repeats without ever making progress.

One possible solution for this problem is «why not just model the entire macrocell as one giant node with a bunch of attributes?» This solution might work, but it has difficulties with optimal packing of buried nodes. The Coolrunner-II architecture has the following quirk:

• A macrocell that needs to output to a pin cannot share a macrocell site with anything else
• A buried combinatorial feedback node can share a macrocell site with both a direct pin input as well as a pin input that goes through the register
• A buried registered feedback node can share a macrocell site with a direct input pin only, and it can only share this site if the registered feedback node is not also simultaneously a combinatorial feedback node.

It was not clear to me how to express this quirk in a way that was actually useful at fixing the above behavior. Since this also required me to write a lot of device-specific code, I was wondering if I could somehow avoid having to do that.

## Second Attempt: SAT/SMT

While I was encountering the above problems, I discussed them with one of my housemates. They suggested that I should try feeding the place-and-route problem into a SAT/SMT solver. The idea behind this was that:

• Modern SAT/SMT solvers have quite a few advanced algorithms in them
• The problem size seemed «small»

With this suggestion, I quickly hacked together a naive lowering of the place-and-route problem into a SMT problem. Since I added this hack into xbpar, it was possible to test it for both GreenPak designs and Coolrunner-II designs.

Unfortunately, although I did not keep very detailed measurements, this approach did not work very well. A major problem was that this was my first attempt at using SMT solvers. Firstly, I selected the QF_LIA theory because I was expecting that I might use it to eventually implement timing-driven place-and-route. Unfortunately, this restricted me to using the SMT solvers that tended to be relatively slower. Secondly, the particular lowering that I chose was relatively naive and probably did not help the SMT solver to try to optimize the problem. Overall, the results were that Coolrunner-II designs never completed in a reasonable amount of time. GreenPak designs would complete relatively quickly if they did indeed fit in the device (solver returns SAT) but would take forever if they did not actually fit in the device (solver returns UNSAT).

### Second-and-a-Half Attempt: Naive backtracking search

Since using a SMT solver did not appear to be working, I decided to try other «brute-force» techniques. As part of this, I wrote a naive backtracking search solver to assign design graph nodes to device graph nodes. I implemented only the «min-remaining-values» optimization. This actually completed incredibly quickly for GreenPak designs that fit. However, GreenPak designs that did not fit were still somewhat slow, and Coolrunner-II designs still did not work at all.

Since this solver was a completely standalone program and written in Python, I could quickly add hacks to it and observe how it behaved. This ended up giving me the following insights:

• There are symmetries or «shortcuts» that naive solvers cannot easily take advantage of (even with methods like forward checking or arc consistency). For example, every product term in a PLA is accessible to every OR gate, and so it does not «really matter» where these product terms are placed. However, backtracking search cannot easily discover this. The same goes for the ZIA — every ZIA row mux output is accessible to every product term, and it does not matter which specific rows are chosen.
• Following on from the above, the only placements that «really matter» are the placements of the macrocells. Everything else can be «mostly» done with greedy assignments. (However, note the scare quotes. Read below a less simplified explanation)
• Backtracking search cannot be used even for just placing macrocells. An approximation must be used here. This is because, at worst, backtracking search can take exponential time and explore every single possibility. If the algorithm happens to be working on the XC2C512 and is trying to place every macrocell into every possible site, this is 512!23875512!≈23875 possibilities. Even if we try to get a much lower bound by pretending that all macrocells in the user design are identical and can be freely swapped, we still have too many worst-case possibilities. To count how many, we would like to know the «number of ways to put identical (unlabeled) objects into labeled bins, where all of the bins have a maximum size.» Unfortunately my combinatorics is a bit rusty, but @eggleroy managed to help me out with the following formula:
f(N,X,S)=⎧⎩⎨⎪⎪⎪⎪⎪⎪010Sk=0f(Nk,X1,S)if N<0if N=0if N>0,X=0otherwisef(N,X,S)={0if N<01if N=00if N>0,X=0∑k=0Sf(N−k,X−1,S)otherwise

where NN is the number of objects (macrocells), XX is the number of bins (function blocks), and SS is the maximum size of each bin (macrocells per function block). NN depends on the size of the user design while XX depends on the specific CPLD device chosen. SS is fixed for all devices in the family. Somewhat unintuitively (at least to me initially), this function is maximized (for our purposes) when N=256,X=32,S=16N=256,X=32,S=16. Computing this yields 33926222943232064651934548544483314385212533926222943232064651934548544483314385≈2125.

If anybody who is more familiar with SAT/SMT solvers would ever like to continue to play with these ideas, there is some code here.

## Detour: xc2jed2json

While working on reverse engineering efforts, I wrote a tool that can «decompile» Coolrunner-II bitstreams back into a .json netlist that can be loaded into yosys. After some (not yet ready for production) steps, this can be turned into something vaguely understandable. Andrew has given a talk about this. More on this in the future.

## Final Attempt: Explicitly Modeling Shortcuts

At this point I was rather frustrated with working with xbpar. Given the insights I obtained from the backtracking search implementation, I decided to write a completely new place-and-route tool. This tool explicitly models all of the «shortcuts» that can be taken within an individual function block but still uses heuristics for placing macrocells. The algorithm works mostly as follows:

1. There is a subroutine that takes as input an assignment of macrocells to macrocell sites and outputs a placement of product terms and a map of ZIA usage. For each function block, this subroutine does:
1. Finds all of the product terms referenced by macrocells in this function block. The algorithm knows about two categories of product terms: those that have some kind of constraint on their placement (because they go into the dedicated product terms (either the per-macrocell PTA/B/C or the per-function-block CTx control terms) or because they have an explicit LOC constraint), and those that have no such constraints and just go into the OR array. The «shortcut» being taken here is that only those product terms that have a constraint need to be specifically handled. Everything else can be greedily assigned.
2. For the first type of product terms (those that have restrictions on their placement), store all the candidate locations for each of them. Possibly filter by the explicit LOC constraints.
3. Use backtracking search to assign the locations of these product terms. If it does not succeed, return an error. This backtracking search is much smaller because macrocells will almost always use the per-macrocell PTA/B/C terms. The only terms that can be contended are the per-function-block CTx control terms. There are only 4 of them, and at worst the algorithm can try assigning one product term from each macrocell into each of the control terms. This is 164=216164=216 which can complete almost instantly.
4. Take the remaining product terms (that have no restrictions on their placement) and assign each one to the first open site. If there are no open sites left, return an error.
5. Independently of the product term placement logic, gather up all inputs into the product terms. The ZIA muxes need to be programmed to provide these as inputs into the function block. Because we are starting with a complete macrocell placement, we know which specific choices we want to search for in the ZIA.
6. Search the ZIA data table for all of the rows that can possibly give each output. Then use backtracking search to pick a feasible assignment. I do not currently have a complexity estimate for this step, but the ZIA was originally designed such that «almost all» combinations should be able to be routed. The «shortcut» here (that was also the insight originally used by Xilinx to design the ZIA) is that it does not matter which row of the ZIA is used to obtain each specific input. It only matters that the entire set of inputs is somehow present (because AND is commutative).
2. The actual PAR tool first reads a yosys .json file and parses it.
3. We create a data structure consisting of «nodes» and «nets» connecting them. The nodes are of a type such as «AND gate,» «OR gate,» «register,» or «XOR gate,» but they are not distinguished from each other (they are all a generic «node» data type). Net objects store their source and sink nodes, but nets can refer to any nodes whatsoever. Essentially, this step takes the yosys data model and filters/parses the individual cell types without any checking of connectivity between them. It turned out that this data structure was too difficult to work with in subsequent steps, so there is also a second data structure.
4. The second data structure no longer models nets. Instead, there are multiple distinguished types of nodes that directly refer to each other. For example, a node corresponding to an OR gate will directly refer to the AND gates that feed it. All of the functionality related to macrocells is also packed into a large «macrocell» node with subcomponents for «the register part,» «the IO part,» and «the combinatorial (XOR) part.» This step does most of the checking for proper connectivity.
5. Macrocells are initially placed with a greedy placement algorithm. One observation from earlier experiments was that most simple designs can be placed with just greedy placement.
6. Run the subroutine described in 1. If it succeeds, the place-and-route is done!
7. If it did not succeed, something needs to be improved. Instead of using simulated annealing, xc2par uses «min-conflicts.» When the subroutine described in part 1 fails, it also tries to calculate which macrocells in the placement contribute most to the failing. To do so, it simply brute-force deassigns each macrocell in the design and checks how many conflicts are left. The min-conflicts algorithm then chooses a «bad» macrocell to move weighted by the number of conflicts each macrocell causes.
8. Once a macrocell to move is chosen, the algorithm tries every location in the device and tries swapping the macrocell into that location (ignoring those that have LOC constraints on them). The algorithm is looking for a site that improves the number of conflicts the most.
9. If such a site is found, commit to the given swap. If not, randomly perform a swap. If an iteration limit is exceeded, give up.

After this is complete, xc2par then outputs a programming file.

# Future work

Since this is an early prototype, there is always room for improvement. Hop into ##openfpga on Freenode to discuss and report bugs.

Notable areas needing improvement are:

• More tests
• Support for IO standards (currently hardcoded)
• Directly generating .svf files (or other means of programming chips)
• Ability to use the XOR-with-PTC functionality more effectively
• Extra passes to handle Xilinx-isms (e.g. things described in the «Xilinx CPLD Libraries Guide»)
• Devices other than the XC2C32/A

## .NET payloads in Windows Notification Facility (WNF) state names using low-level Windows Kernel API calls.

PoC for persisting .NET payloads in Windows Notification Facility (WNF) state names using low-level Windows Kernel API calls.  — PoC example

With billions of people around the globe using Facebook services, our infrastructure engineers have created a range of systems to optimize traffic and to enable fast, reliable access for everyone. Today, we are open-sourcing a component of this work by releasing the Katran forwarding plane software library, which powers the network load balancer used in Facebook’s infrastructure. Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Katran is deployed today on backend servers in Facebook’s points of presence (PoPs), and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets. By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.

## The challenge of serving requests at Facebook scale

To manage traffic at Facebook scale, we have deployed a globally distributed network of points of presence to act as proxies for our data centers. Given the extremely high volume of requests, both PoPs and data centers confront the challenge of making the large fleet of (backend) servers appear as a single virtual unit to the outside world and also distributing the workload efficiently among those backend servers.

These challenges are typically addressed by announcing a virtual IP address (VIP) to the internet at each location. Packets destined to the VIP are then seamlessly distributed among the backend servers. The distribution algorithm, however, needs to account for the fact that the backend servers typically operate at an application layer and terminate the TCP connections. This responsibility is handled by a network load balancer (often called a layer 4 load balancer, or an L4LB, because it operates on packets rather than serving application level requests). Figure 1 illustrates the role of an L4LB in relation to the other network components.

## Requirements for a high-performance load balancer

An L4LB’s performance is especially important for managing latency and scaling the number of backend servers, because the L4LBs are on a path that needs to process every incoming packet. Performance is typically measured as peak packets per second (pps) that the L4LB can process. Traditionally, engineers have preferred hardware-based solutions for this task because they typically use accelerators such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) to reduce the burden on the main CPU. However, one of the drawbacks of a hardware-centric approach is that it limits the system’s flexibility. To effectively serve Facebook’s needs, a network load balancer must:

• Run on commodity Linux servers. This allows us to run the load balancer on part or all of the large fleet of currently deployed servers. A software-based load balancer satisfies this criteria.
• Coexist with other services on a given server. This removes the need for dedicated servers that run the load balancer exclusively, thereby increasing fault tolerance.
• Allow low-disruption maintenance. Facebook’s software must be able to evolve quickly in order to support new or improved products and services. Maintenance and upgrades are a norm, not exceptions, for the load balancer and backend layers. Minimizing disruption during these events allows us to iterate faster.
• Offer easy instrumentation and debugging. All large distributed infrastructures must contend with anomalies and unexpected events, so reducing the time to debug and troubleshoot issues is important. The load balancer needs to be instrumentable and friendly to standard tools like tcpdump.

In order to solve for these requirements, we designed a high-performance software network load balancer. The first generation of our L4LB was based on the IPVS kernel module and served Facebook’s needs for well over four years. However, it fell short on the goal of coexistence with other services, specifically the backends. In the second iteration, we leveraged the eXpress Data Path (XDP) framework and the new BPF virtual machine (eBPF) to run the software load balancer together with the backends on a large number of machines. Figure 2 shows the key difference between the two generations.

## Our First-generation L4LB: Building on OSS Software

With our first-generation L4LB, we leaned heavily on existing open source components to implement most of the functionality. This approach helped us replace a hardware-based solution across a large deployment in only a few months. The design has four major components:

• VIP announcement: This component simply announces the virtual IP addresses that the L4LB is responsible for to the world by peering with the network element (typically a switch) in front of the L4LB. The switch then uses an equal-cost multipath (ECMP) mechanism to distribute packets among the L4LBs announcing the VIP. We used ExaBGP for the VIP announcement because of its lightweight, flexible design.
• Backend server selection: In order to send all packets from a client to the same backend, the L4LBs use a consistent hash that depends on the 5-tuple (source address, source port, destination address, destination port, and protocol) of the incoming packet. The use of a consistent hash ensures that all packets that belong to a transport connection are sent to the same backend irrespective of the L4LB receiving the packet. This removes the need for any state synchronization across multiple L4LBs. The consistent hash also guarantees minimal disruption to existing connections when a backend leaves or joins the pool of backends.
• Forwarding plane: Once the L4LB picks the appropriate backend, the packets need to be forwarded to that host. To avoid restrictions such as keeping L4LB and backend hosts on the same L2 domain, we use a simple IP-in-IP encapsulation. This allows us to place L4LB and backend hosts in different racks. We used the IPVS kernel module for the encapsulation. The backends are configured to have the corresponding VIP on their loopback interface. This allows the backend to send packets on the return path directly to the client (instead of the L4LB). This optimization, often called direct server return (DSR), allows the L4LB to be constrained only by the incoming packet volume.
• Control plane: This component performs various functions, including performing health checks on the backend servers, providing a simple interface (via a configuration file) to add or remove VIPs, and providing simple APIs to examine the state of the L4LB and backend servers. We developed this component in-house.

Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness. This design met several requirements of Facebook’s workload listed above, but there was one major drawback: Colocating the L4LB and a backend on a single host increased the chance of device failure. Even with the local state, the L4LB was a CPU-intensive component. To separate the failure domains, we ran the L4LBs and backend servers on a disjointed set of machines. There were fewer L4LBs than backend servers in this setup, which made the L4LBs more vulnerable to a sudden increase in load. The fact that packets had to traverse the regular Linux network stack before being handled by the L4LB exacerbated the problem.

## Katran: Reimagining the forwarding plane

Katran, our second-generation L4LB, significantly improves upon the previous version with a completely reengineered forwarding plane. Two recent developments in the kernel world powered the new design:

• The XDP provides a fast, programmable network data path without resorting to a full-fledged kernel bypass method and works in conjunction with the Linux networking stack. (A detailed overview of XDP is available here.)
• The eBPF virtual machine provides a flexible, efficient, and more reliable way to interact with the Linux kernel and to extend its functionality by running user-space supplied programs at specific points in the kernel. eBPF has already brought dramatic improvements to several areas, including tracing and filtering. (More details are available here.)

The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP announces to the world which VIPS a particular Katran instance is responsible for. Second, packets destined to a VIP are sent to Katran instances using an ECMP mechanism. Finally, Katran selects a backend and forwards the packet to the correct backend server. The main differences are in the last step.

Early and efficient packet handling: Katran uses XDP in combination with a BPF program for packet forwarding. When XDP is enabled in driver mode, a packet handling routine (BPF program) is run immediately after a packet is received by the network interface card (NIC) and before the kernel intercepts it. XDP invokes the BPF program on every incoming packet. If the NIC has multiple queues, the program is invoked in parallel for each one them. The BPF program used for handling packets is lockless and uses a per-CPU version of BPF maps. Due to this parallelism, performance scales linearly with the number of the NIC’s RX queues. Katran also supports the “generic XDP” mode (instead of driver mode) of operation, at a performance cost.

Inexpensive and more stable hashing: Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers. The last of these is an important feature that allows us to handle hardware refreshes in our PoPs and data centers easily: We can absorb the newer generation hardware by simply setting appropriate weights. Despite its being more expressive, the code for computing this hash is small enough to fit entirely in the L1 cache.

More resilient local state: Katran’s efficiency at handling packets and computing the hash results in an interesting interaction with the local state table. We observed that, quite often, computing the hash is computationally easier than looking up the local state table for the 5-tuple to backend server choice. This is more visible for cases where the local state table lookup traverses all the way to the shared last level cache. In order to take advantage of this phenomenon in a natural way, we implemented the lookup table as an LRU-evicting cache. The LRU cache size is configurable at startup time and acts as a tunable parameter to strike a balance between computation and lookup. We picked these values empirically to optimize for pps. In addition, Katran provides a runtime “compute only” switch to ignore the LRU cache altogether in the event of catastrophic memory pressure on the host.

RSS-friendly encapsulation: Received Side Scaling (RSS) is an important optimization in NICs that aims to spread load across CPUs uniformly by steering packets from each flow to a separate CPU. Katran crafts its encapsulation to work in conjunction with RSS. Instead of using the same outer source for every IP-in-IP packet, packets in different flows (e.g., with different 5-tuples) are encapsulated using a different outer source IP, but packets in the same flow are always assigned the same outer source IP.

These features dramatically enhance performance, flexibility and scalability of the L4LB. Katran’s design also gets rid of busy loops on receive path barely consuming any CPU if there are no incoming packets. In contrast to a full-fledged Kernel Bypass solution (such as DPDK), using XDP allows us to run Katran alongside any application without any performance penalties on the same host. Katran today runs alongside the backend servers in our PoPs with an improved L4LB-to-backend ratio. This increases resilience to load spikes, host failures, and maintenance, as well. The reengineered forwarding plane was central to this shift. We believe other systems can benefit by using our forwarding plane, so we are open-sourcing our code and including several examples of how to use it to craft an L4LB.