Modlishka — An Open Source Phishing Tool With 2FA Authentication

Original text by Lydecher black


Modlishka — An Open Source Phishing Tool With 2FA Authentication

Modlishka is a flexible and powerful reverse proxy, that will take your phishing campaigns to the next level (with minimal effort required from your side).
Enjoy 🙂

Features
Some of the most important ‘Modlishka’ features :

  • Support for majority of 2FA authentication schemes (by design).
  • No website templates (just point Modlishka to the target domain — in most cases, it will be handled automatically).
  • Full control of «cross» origin TLS traffic flow from your victims browsers.
  • Flexible and easily configurable phishing scenarios through configuration options.
  • Pattern based JavaScript payload injection.
  • Striping website from all encryption and security headers (back to 90’s MITM style).
  • User credential harvesting (with context based on URL parameter passed identifiers).
  • Can be extended with your ideas through plugins.
  • Stateless design. Can be scaled up easily for an arbitrary number of users — ex. through a DNS load balancer.
  • Web panel with a summary of collected credentials and user session impersonation (beta).
  • Written in Go.

Action
«A picture is worth a thousand words»:
Modlishka in action against an example 2FA (SMS) enabled authentication scheme:

Note: google.com was chosen here just as a POC.

Installation
Latest source code version can be fetched from here (zip) or here (tar).
Fetch the code with ‘go get’ :

$ go get -u github.com/drk1wi/Modlishka

Compile the binary and you are ready to go:

$ cd $GOPATH/src/github.com/drk1wi/Modlishka/
$ make
# ./dist/proxy -h


Usage of ./dist/proxy:
      
  -cert string
     base64 encoded TLS certificate
  
  -certKey string
     base64 encoded TLS certificate key
  
  -certPool string
     base64 encoded Certification Authority certificate
  
  -config string
     JSON configuration file. Convenient instead of using command line switches.
  
  -credParams string
       Credential regexp collector with matching groups. Example: base64(username_regex),base64(password_regex)

  -debug
     Print debug information
  
  -disableSecurity
     Disable security features like anti-SSRF. Disable at your own risk.
  
  -jsRules string
     Comma separated list of URL patterns and JS base64 encoded payloads that will be injected. 
  
  -listeningAddress string
     Listening address (default "127.0.0.1")
  
  -listeningPort string
     Listening port (default "443")
  
  -log string
     Local file to which fetched requests will be written (appended)
  
  -phishing string
     Phishing domain to create - Ex.: target.co
  
  -plugins string
     Comma seperated list of enabled plugin names (default "all")
  
  -postOnly
     Log only HTTP POST requests
  
  -rules string
     Comma separated list of 'string' patterns and their replacements. 
  
  -target string
     Main target to proxy - Ex.: https://target.com
  
  -targetRes string
     Comma separated list of target subdomains that need to pass through the  proxy 
  
  -terminateTriggers string
     Comma separated list of URLs from target's origin which will trigger session termination
  
  -terminateUrl string
     URL to redirect the client after session termination triggers
  
  -tls
     Enable TLS (default false)
  
  -trackingCookie string
     Name of the HTTP cookie used to track the victim (default "id")
  
  -trackingParam string
     Name of the HTTP parameter used to track the victim (default "id")

Usage

  • Check out the wiki page for a more detailed overview of the tool usage.
  • FAQ (Frequently Asked Questions)
  • Blog post

Credits
Thanks for helping with the code go to Giuseppe Trotta (@Giutro)

Реклама

Hypervisor From Scratch – Part 2: Entering VMX Operation

Original text bySinaei )

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

12345678910NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

12345678910111213141516171819202122232425262728293031323334353637383940414243NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

123456789101112131415161718192021222324252627282930#define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:

1234567891011121314151617181920212223IRP_MN_START_DEVICEIRP_MN_QUERY_STOP_DEVICEIRP_MN_STOP_DEVICEIRP_MN_CANCEL_STOP_DEVICEIRP_MN_QUERY_REMOVE_DEVICEIRP_MN_REMOVE_DEVICEIRP_MN_CANCEL_REMOVE_DEVICEIRP_MN_SURPRISE_REMOVALIRP_MN_QUERY_CAPABILITIES IRP_MN_QUERY_PNP_DEVICE_STATEIRP_MN_FILTER_RESOURCE_REQUIREMENTSIRP_MN_DEVICE_USAGE_NOTIFICATIONIRP_MN_QUERY_DEVICE_RELATIONSIRP_MN_QUERY_RESOURCESIRP_MN_QUERY_RESOURCE_REQUIREMENTSIRP_MN_QUERY_IDIRP_MN_QUERY_DEVICE_TEXTIRP_MN_QUERY_BUS_INFORMATIONIRP_MN_QUERY_INTERFACEIRP_MN_READ_CONFIGIRP_MN_WRITE_CONFIGIRP_MN_DEVICE_ENUMERATEDIRP_MN_SET_LOCK

Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately. 

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

123456789101112131415161718192021222324252627282930typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Defined Headers

I created the following headers (source.h) for my driver.

12345678910111213141516171819202122232425262728293031323334#pragma once#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

Now just compile your driver.

Loading Driver and Check the presence of Device

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

Enable Capturing Event

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Load Driver in OSR Driver Loader

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.

WinObj

The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

  1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
  2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
  3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

12prlkd> x nt!kd_Default_Maskfffff801`f5211808 nt!Kd_DEFAULT_Mask = <no type information>

Now change it value to 0xffffffff.

1lkd> eb fffff801`f5211808 ff ff ff ff
kd_DEFAULT_Mask

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.

DbgView

Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

12345678910111213141516171819202122232425262728293031323334353637string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

1234567891011121314151617181920bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

123456789101112131415161718192021222324252627int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

User-mode app

Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

  •  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
  •  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
  •  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

Just in Header File (in my case Source.h) declare your function:

1extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

1234567891011Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

1PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

123456NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Final Show

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.

Conclusion

In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch/]

References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

Unofficial opensource place-and-route for Xilinx Coolrunner-II CPLDs

( Original text by Robert Ou )

Recently, I have been working on an open-source tool for producing bitstreams for Xilinx Coolrunner-II CPLDs that I have named xc2par. It is finally at a point where I no longer feel embarrassed to have other people try and use it.

Quick start

The first tool you will need is yosys. You need a version that contains commit 14e49fb05737cfb02217b2aaf14d9f5d9e1859da, but it is recommended to use the latest master branch. Follow the instructions in the yosys README to install it (normally I would point people to my pre-built binaries, but those are currently broken).

Next, you will need to install the compiler for the Rust programming language.

After installing Rust, clone the «openfpga» repository. Change directories into the src/xc2par directory and run cargo build --release. When this completes, the final command-line tool will be in target/release/xc2par.

At this point you are ready to try compiling some HDL! Unfortunately, xc2par does not currently accept exactly the same code as the proprietary Xilinx tools, and there is also currently no documentation on what is or is not accepted. You will have to either guess or read the source code to figure out what currently works. Most notably, .ucf constraint files are not supported (you must use LOC attributes, and they must use the function block and macrocell number rather than a package pin), and automatic insertion of BUFGs is not supported. However, to get started, the following is some example Verilog for an LED blinker:

module top(led0, led1, led2, led3, clk_);

// NOTE: Must use LOC attributes rather than a .ucf file
(* LOC = "FB1_9" *)
output led0;
(* LOC = "FB1_10" *)
output led1;
(* LOC = "FB1_11" *)
output led2;
(* LOC = "FB1_12" *)
output led3;
(* LOC = "FB2_5" *)
input clk_;

// NOTE: Must manually instantiate BUFG
wire clk;
BUFG bufg0 (
    .I(clk_),
    .O(clk),
);

reg [23:0] counter;

assign {led3, led2, led1, led0} = counter[23:20];

always @(posedge clk)
    counter <= counter + 1;

endmodule

To compile this code, use the following commands:

yosys -p "synth_coolrunner2 -json blinky.json" blinky.v
./target/release/xc2par -p xc2c32a-4-vq44 blinky.json

At this point, if everything worked correctly, there will be a blinky.jed file in the same directory as the blinky.json file. It should now be possible to program this file into a CPLD to test it out. However, embarrassingly, there currently isn’t any code in the xc2par project that can help with that. You will either need to use

  • ISE/iMPACT (either to generate a .svf file or to program parts directly)
  • Andrew Zonenberg’s jtaghal (only supports the 32A and 64A parts)
  • Some other JTAG programmer that can read Xilinx .jed files

The Origins of xc2par

In the beginning, there was only Andrew Zonenberg. Back in 2014, Andrew took some CPLDs and exposed them to a slew of nasty chemicals and some high-energy electrons. This culminated in a talk presented at REcon 2015. At this point in time, the PLA (AND and OR gates) and interconnect were understood but the macrocell configuration was not. Although Andrew had always wanted to write an open-source compiler for these chips, yosys did not have support for sum-of-products architectures at this point in time. This combined with Life™ led to the project being temporarily shelved.

Robert Has Joined the Party

Being a «hacker» with extremely varied interests, I had been lurking Andrew’s work for years (since at least 2012). I had an interest in both programmable logic devices and reverse engineering. However, since I was pretty busy with Life™ of my own, I had never considered ever contacting Andrew or collaborating. However, one of Andrew’s projects combined with perfect timing changed this.

Quick Detour: Introducing GreenPak

I first heard about these interesting tiny programmable logic devices from an article on Hackaday. An interesting feature of these parts is that the bitstream format is completely documented in the datasheet. After reading the article, I purchased a GreenPak development kit with the idea of making «some kind of domain-specific-language» for programming these parts (I was not ambitious enough to consider supporting a traditional HDL, and I was not yet aware that yosys existed at the time). However, since I was still working on Life™ (being a university student), this project also got shelved.

However, in the meantime, Andrew was working on his open-source Verilog flow for GreenPak. This tool just so happened to be released almost exactly around the time that I finished my undergrad degree. Since I had planned to take a year off to remain «funemployed,» I decided to introduce myself to Andrew and joined the ##openfpga IRC community on Freenode somewhere around this time. Andrew eventually suggested that I could work on a place-and-route for Coolrunner-II parts (yosys support for sum-of-products had also gotten added around this time).

Finishing the Reverse Engineering

When Andrew had originally set aside the Coolrunner-II toolchain project back in 2015, the macrocell configuration was not yet understood. Andrew suggested that I ignore the macrocell-related parts and write tooling to handle the other parts of the device first. Macrocells would be configured by copying the entire block of bits from an existing ISE-generated design. Dissatisfied with this answer, I decided to schedule an IRL hackathon with Andrew to figure out all of the remaining bits. This resulted in the following:

Whiteboard drawing of Coolrunner-II macrocell

Note that this diagram is not accurate and has errors. A more accurate diagram needs to be drawn eventually.

Now with all the bits understood, we could begin writing tooling that did not require any strange hacks.

xc2bit and xc2par are Born

At the end of May 2017, the first commit of the «xc2bit» code is created. This is a Rust library for performing low-level manipulations of Coolrunner-II bitstreams. This is analogous to Andrew’s never-finished libcrowbar library. The first commit of xc2par is created a few days later.

First Attempt: xbpar

The first version of xc2par was supposed to use xbpar, Andrew’s generic C++ framework for writing place-and-route tools for «crossbar interconnect» architectures. The internals of xbpar are described in Andrew’s blog post about the GreenPak Verilog tooling, but the general idea is that the algorithm takes as input two graphs, one modeling the target device and one modeling the user’s design, and attempts to check if the user’s design graph is a subgraph of the device structure graph. It does this by attempting to «pair up» nodes of the two graphs and then checking if the edges between the nodes of the design graph exist in the device graph. If any edges do not, it attempts to find something better using simulated annealing.

Trying to use this library for xc2par had a number of issues. The first issue was «ideological.» I wanted to use Rust for xc2par because I did not like C++ but wanted more powerful libraries and abstractions than are available in C. Unfortunately, xbpar is a C++ library that isn’t very friendly to being interfaced with a different programming language. Among other things, device-specific functionality is implemented by deriving from an xbpar base class and overriding some methods. However, I persevered and attempted to write glue code between the two languages. Unfortunately, all of this glue code ended up being more lines of code than xbpar itself.

A second issue was that the specific mechanism I had used for describing graphs for xbpar caused extremely poor behavior. I had described every «enable-able or disable-able» feature of the macrocell as an individual node in xbpar (so the XOR gate of a macrocell would be a node, the register would be a different node, and the IO pad would be yet another different node). However, this would cause the following behavior:

  1. The greedy initial placement algorithm packs together «the wrong» set of «macrocell pieces.» For example, it would place the IO pad for pin a and the XOR gate for pin b both at function block 1 macrocell 1. Meanwhile, the XOR gate for pin a and the IO pad for pin b both get placed at function block 1 macrocell 2.
  2. The algorithm would notice that there are not proper edges between the nodes. In the above example, there is an edge in the device graph between the XOR and IO pad of FB1_1, but there are no edges between the IO pad of FB1_1 with the XOR gate of FB1_2 or between the XOR gate of FB1_1 with the IO pad of FB1_2. Therefore, all components of pins a and b will get added to the list of nodes contributing to the bad scores.
  3. The simulated annealing step picks one of the «bad» nodes and tries to move it. It ends up moving e.g. the XOR gate of one pin to a completely different random spot. This does not improve the situation since the XOR gate needs to be in the same location as all of the other «macrocell bits» for the same pin (but it has just been moved somewhere random and unrelated).
  4. The cycle repeats without ever making progress.

One possible solution for this problem is «why not just model the entire macrocell as one giant node with a bunch of attributes?» This solution might work, but it has difficulties with optimal packing of buried nodes. The Coolrunner-II architecture has the following quirk:

  • A macrocell that needs to output to a pin cannot share a macrocell site with anything else
  • A buried combinatorial feedback node can share a macrocell site with both a direct pin input as well as a pin input that goes through the register
  • A buried registered feedback node can share a macrocell site with a direct input pin only, and it can only share this site if the registered feedback node is not also simultaneously a combinatorial feedback node.

It was not clear to me how to express this quirk in a way that was actually useful at fixing the above behavior. Since this also required me to write a lot of device-specific code, I was wondering if I could somehow avoid having to do that.

Second Attempt: SAT/SMT

While I was encountering the above problems, I discussed them with one of my housemates. They suggested that I should try feeding the place-and-route problem into a SAT/SMT solver. The idea behind this was that:

  • Modern SAT/SMT solvers have quite a few advanced algorithms in them
  • The problem size seemed «small»

With this suggestion, I quickly hacked together a naive lowering of the place-and-route problem into a SMT problem. Since I added this hack into xbpar, it was possible to test it for both GreenPak designs and Coolrunner-II designs.

Unfortunately, although I did not keep very detailed measurements, this approach did not work very well. A major problem was that this was my first attempt at using SMT solvers. Firstly, I selected the QF_LIA theory because I was expecting that I might use it to eventually implement timing-driven place-and-route. Unfortunately, this restricted me to using the SMT solvers that tended to be relatively slower. Secondly, the particular lowering that I chose was relatively naive and probably did not help the SMT solver to try to optimize the problem. Overall, the results were that Coolrunner-II designs never completed in a reasonable amount of time. GreenPak designs would complete relatively quickly if they did indeed fit in the device (solver returns SAT) but would take forever if they did not actually fit in the device (solver returns UNSAT).

Second-and-a-Half Attempt: Naive backtracking search

Since using a SMT solver did not appear to be working, I decided to try other «brute-force» techniques. As part of this, I wrote a naive backtracking search solver to assign design graph nodes to device graph nodes. I implemented only the «min-remaining-values» optimization. This actually completed incredibly quickly for GreenPak designs that fit. However, GreenPak designs that did not fit were still somewhat slow, and Coolrunner-II designs still did not work at all.

Since this solver was a completely standalone program and written in Python, I could quickly add hacks to it and observe how it behaved. This ended up giving me the following insights:

  • There are symmetries or «shortcuts» that naive solvers cannot easily take advantage of (even with methods like forward checking or arc consistency). For example, every product term in a PLA is accessible to every OR gate, and so it does not «really matter» where these product terms are placed. However, backtracking search cannot easily discover this. The same goes for the ZIA — every ZIA row mux output is accessible to every product term, and it does not matter which specific rows are chosen.
  • Following on from the above, the only placements that «really matter» are the placements of the macrocells. Everything else can be «mostly» done with greedy assignments. (However, note the scare quotes. Read below a less simplified explanation)
  • Backtracking search cannot be used even for just placing macrocells. An approximation must be used here. This is because, at worst, backtracking search can take exponential time and explore every single possibility. If the algorithm happens to be working on the XC2C512 and is trying to place every macrocell into every possible site, this is 512!23875512!≈23875 possibilities. Even if we try to get a much lower bound by pretending that all macrocells in the user design are identical and can be freely swapped, we still have too many worst-case possibilities. To count how many, we would like to know the «number of ways to put identical (unlabeled) objects into labeled bins, where all of the bins have a maximum size.» Unfortunately my combinatorics is a bit rusty, but @eggleroy managed to help me out with the following formula:
    f(N,X,S)=⎧⎩⎨⎪⎪⎪⎪⎪⎪010Sk=0f(Nk,X1,S)if N<0if N=0if N>0,X=0otherwisef(N,X,S)={0if N<01if N=00if N>0,X=0∑k=0Sf(N−k,X−1,S)otherwise

    where NN is the number of objects (macrocells), XX is the number of bins (function blocks), and SS is the maximum size of each bin (macrocells per function block). NN depends on the size of the user design while XX depends on the specific CPLD device chosen. SS is fixed for all devices in the family. Somewhat unintuitively (at least to me initially), this function is maximized (for our purposes) when N=256,X=32,S=16N=256,X=32,S=16. Computing this yields 33926222943232064651934548544483314385212533926222943232064651934548544483314385≈2125.

If anybody who is more familiar with SAT/SMT solvers would ever like to continue to play with these ideas, there is some code here.

Detour: xc2jed2json

While working on reverse engineering efforts, I wrote a tool that can «decompile» Coolrunner-II bitstreams back into a .json netlist that can be loaded into yosys. After some (not yet ready for production) steps, this can be turned into something vaguely understandable. Andrew has given a talk about this. More on this in the future.

Final Attempt: Explicitly Modeling Shortcuts

At this point I was rather frustrated with working with xbpar. Given the insights I obtained from the backtracking search implementation, I decided to write a completely new place-and-route tool. This tool explicitly models all of the «shortcuts» that can be taken within an individual function block but still uses heuristics for placing macrocells. The algorithm works mostly as follows:

  1. There is a subroutine that takes as input an assignment of macrocells to macrocell sites and outputs a placement of product terms and a map of ZIA usage. For each function block, this subroutine does:
    1. Finds all of the product terms referenced by macrocells in this function block. The algorithm knows about two categories of product terms: those that have some kind of constraint on their placement (because they go into the dedicated product terms (either the per-macrocell PTA/B/C or the per-function-block CTx control terms) or because they have an explicit LOC constraint), and those that have no such constraints and just go into the OR array. The «shortcut» being taken here is that only those product terms that have a constraint need to be specifically handled. Everything else can be greedily assigned.
    2. For the first type of product terms (those that have restrictions on their placement), store all the candidate locations for each of them. Possibly filter by the explicit LOC constraints.
    3. Use backtracking search to assign the locations of these product terms. If it does not succeed, return an error. This backtracking search is much smaller because macrocells will almost always use the per-macrocell PTA/B/C terms. The only terms that can be contended are the per-function-block CTx control terms. There are only 4 of them, and at worst the algorithm can try assigning one product term from each macrocell into each of the control terms. This is 164=216164=216 which can complete almost instantly.
    4. Take the remaining product terms (that have no restrictions on their placement) and assign each one to the first open site. If there are no open sites left, return an error.
    5. Independently of the product term placement logic, gather up all inputs into the product terms. The ZIA muxes need to be programmed to provide these as inputs into the function block. Because we are starting with a complete macrocell placement, we know which specific choices we want to search for in the ZIA.
    6. Search the ZIA data table for all of the rows that can possibly give each output. Then use backtracking search to pick a feasible assignment. I do not currently have a complexity estimate for this step, but the ZIA was originally designed such that «almost all» combinations should be able to be routed. The «shortcut» here (that was also the insight originally used by Xilinx to design the ZIA) is that it does not matter which row of the ZIA is used to obtain each specific input. It only matters that the entire set of inputs is somehow present (because AND is commutative).
  2. The actual PAR tool first reads a yosys .json file and parses it.
  3. We create a data structure consisting of «nodes» and «nets» connecting them. The nodes are of a type such as «AND gate,» «OR gate,» «register,» or «XOR gate,» but they are not distinguished from each other (they are all a generic «node» data type). Net objects store their source and sink nodes, but nets can refer to any nodes whatsoever. Essentially, this step takes the yosys data model and filters/parses the individual cell types without any checking of connectivity between them. It turned out that this data structure was too difficult to work with in subsequent steps, so there is also a second data structure.
  4. The second data structure no longer models nets. Instead, there are multiple distinguished types of nodes that directly refer to each other. For example, a node corresponding to an OR gate will directly refer to the AND gates that feed it. All of the functionality related to macrocells is also packed into a large «macrocell» node with subcomponents for «the register part,» «the IO part,» and «the combinatorial (XOR) part.» This step does most of the checking for proper connectivity.
  5. Macrocells are initially placed with a greedy placement algorithm. One observation from earlier experiments was that most simple designs can be placed with just greedy placement.
  6. Run the subroutine described in 1. If it succeeds, the place-and-route is done!
  7. If it did not succeed, something needs to be improved. Instead of using simulated annealing, xc2par uses «min-conflicts.» When the subroutine described in part 1 fails, it also tries to calculate which macrocells in the placement contribute most to the failing. To do so, it simply brute-force deassigns each macrocell in the design and checks how many conflicts are left. The min-conflicts algorithm then chooses a «bad» macrocell to move weighted by the number of conflicts each macrocell causes.
  8. Once a macrocell to move is chosen, the algorithm tries every location in the device and tries swapping the macrocell into that location (ignoring those that have LOC constraints on them). The algorithm is looking for a site that improves the number of conflicts the most.
  9. If such a site is found, commit to the given swap. If not, randomly perform a swap. If an iteration limit is exceeded, give up.

After this is complete, xc2par then outputs a programming file.

Future work

Since this is an early prototype, there is always room for improvement. Hop into ##openfpga on Freenode to discuss and report bugs.

Notable areas needing improvement are:

  • More tests
  • Support for IO standards (currently hardcoded)
  • Directly generating .svf files (or other means of programming chips)
  • Ability to use the XOR-with-PTC functionality more effectively
  • Extra passes to handle Xilinx-isms (e.g. things described in the «Xilinx CPLD Libraries Guide»)
  • Devices other than the XC2C32/A

Open-sourcing Katran, a scalable network load balancer (Facebook libs)

With billions of people around the globe using Facebook services, our infrastructure engineers have created a range of systems to optimize traffic and to enable fast, reliable access for everyone. Today, we are open-sourcing a component of this work by releasing the Katran forwarding plane software library, which powers the network load balancer used in Facebook’s infrastructure. Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Katran is deployed today on backend servers in Facebook’s points of presence (PoPs), and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets. By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.

The challenge of serving requests at Facebook scale

To manage traffic at Facebook scale, we have deployed a globally distributed network of points of presence to act as proxies for our data centers. Given the extremely high volume of requests, both PoPs and data centers confront the challenge of making the large fleet of (backend) servers appear as a single virtual unit to the outside world and also distributing the workload efficiently among those backend servers.

These challenges are typically addressed by announcing a virtual IP address (VIP) to the internet at each location. Packets destined to the VIP are then seamlessly distributed among the backend servers. The distribution algorithm, however, needs to account for the fact that the backend servers typically operate at an application layer and terminate the TCP connections. This responsibility is handled by a network load balancer (often called a layer 4 load balancer, or an L4LB, because it operates on packets rather than serving application level requests). Figure 1 illustrates the role of an L4LB in relation to the other network components.

Figure 1: A network load balancer fronts several backend servers running a backend application and consistently sends all packets from each client connection to a unique backend server.

Requirements for a high-performance load balancer

An L4LB’s performance is especially important for managing latency and scaling the number of backend servers, because the L4LBs are on a path that needs to process every incoming packet. Performance is typically measured as peak packets per second (pps) that the L4LB can process. Traditionally, engineers have preferred hardware-based solutions for this task because they typically use accelerators such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) to reduce the burden on the main CPU. However, one of the drawbacks of a hardware-centric approach is that it limits the system’s flexibility. To effectively serve Facebook’s needs, a network load balancer must:

  • Run on commodity Linux servers. This allows us to run the load balancer on part or all of the large fleet of currently deployed servers. A software-based load balancer satisfies this criteria.
  • Coexist with other services on a given server. This removes the need for dedicated servers that run the load balancer exclusively, thereby increasing fault tolerance.
  • Allow low-disruption maintenance. Facebook’s software must be able to evolve quickly in order to support new or improved products and services. Maintenance and upgrades are a norm, not exceptions, for the load balancer and backend layers. Minimizing disruption during these events allows us to iterate faster.
  • Offer easy instrumentation and debugging. All large distributed infrastructures must contend with anomalies and unexpected events, so reducing the time to debug and troubleshoot issues is important. The load balancer needs to be instrumentable and friendly to standard tools like tcpdump.

In order to solve for these requirements, we designed a high-performance software network load balancer. The first generation of our L4LB was based on the IPVS kernel module and served Facebook’s needs for well over four years. However, it fell short on the goal of coexistence with other services, specifically the backends. In the second iteration, we leveraged the eXpress Data Path (XDP) framework and the new BPF virtual machine (eBPF) to run the software load balancer together with the backends on a large number of machines. Figure 2 shows the key difference between the two generations.

Figure 2: Differences between the two generations of L4LBs. Note that both are software load balancers running on backend servers. Katran (right) allows us to colocate the load balancer with backend application, thus increasing the load balancer capacity.

Our First-generation L4LB: Building on OSS Software

With our first-generation L4LB, we leaned heavily on existing open source components to implement most of the functionality. This approach helped us replace a hardware-based solution across a large deployment in only a few months. The design has four major components:

  • VIP announcement: This component simply announces the virtual IP addresses that the L4LB is responsible for to the world by peering with the network element (typically a switch) in front of the L4LB. The switch then uses an equal-cost multipath (ECMP) mechanism to distribute packets among the L4LBs announcing the VIP. We used ExaBGP for the VIP announcement because of its lightweight, flexible design.
  • Backend server selection: In order to send all packets from a client to the same backend, the L4LBs use a consistent hash that depends on the 5-tuple (source address, source port, destination address, destination port, and protocol) of the incoming packet. The use of a consistent hash ensures that all packets that belong to a transport connection are sent to the same backend irrespective of the L4LB receiving the packet. This removes the need for any state synchronization across multiple L4LBs. The consistent hash also guarantees minimal disruption to existing connections when a backend leaves or joins the pool of backends.
  • Forwarding plane: Once the L4LB picks the appropriate backend, the packets need to be forwarded to that host. To avoid restrictions such as keeping L4LB and backend hosts on the same L2 domain, we use a simple IP-in-IP encapsulation. This allows us to place L4LB and backend hosts in different racks. We used the IPVS kernel module for the encapsulation. The backends are configured to have the corresponding VIP on their loopback interface. This allows the backend to send packets on the return path directly to the client (instead of the L4LB). This optimization, often called direct server return (DSR), allows the L4LB to be constrained only by the incoming packet volume.
  • Control plane: This component performs various functions, including performing health checks on the backend servers, providing a simple interface (via a configuration file) to add or remove VIPs, and providing simple APIs to examine the state of the L4LB and backend servers. We developed this component in-house.

Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness. This design met several requirements of Facebook’s workload listed above, but there was one major drawback: Colocating the L4LB and a backend on a single host increased the chance of device failure. Even with the local state, the L4LB was a CPU-intensive component. To separate the failure domains, we ran the L4LBs and backend servers on a disjointed set of machines. There were fewer L4LBs than backend servers in this setup, which made the L4LBs more vulnerable to a sudden increase in load. The fact that packets had to traverse the regular Linux network stack before being handled by the L4LB exacerbated the problem.

Figure 3: Overview of our first-generation L4LB. Note that the load balancer and the backend application run on different machines. Different load balancers make consistent decisions without any state synchronization. Using packet encapsulation allows the servers running the load balancer and the backend application to be placed in different racks. In a typical deployment, the ratio of the number of L4LBs to the number of backend application servers is very small.

Katran: Reimagining the forwarding plane

Katran, our second-generation L4LB, significantly improves upon the previous version with a completely reengineered forwarding plane. Two recent developments in the kernel world powered the new design:

  • The XDP provides a fast, programmable network data path without resorting to a full-fledged kernel bypass method and works in conjunction with the Linux networking stack. (A detailed overview of XDP is available here.)
  • The eBPF virtual machine provides a flexible, efficient, and more reliable way to interact with the Linux kernel and to extend its functionality by running user-space supplied programs at specific points in the kernel. eBPF has already brought dramatic improvements to several areas, including tracing and filtering. (More details are available here.)

The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP announces to the world which VIPS a particular Katran instance is responsible for. Second, packets destined to a VIP are sent to Katran instances using an ECMP mechanism. Finally, Katran selects a backend and forwards the packet to the correct backend server. The main differences are in the last step.

Early and efficient packet handling: Katran uses XDP in combination with a BPF program for packet forwarding. When XDP is enabled in driver mode, a packet handling routine (BPF program) is run immediately after a packet is received by the network interface card (NIC) and before the kernel intercepts it. XDP invokes the BPF program on every incoming packet. If the NIC has multiple queues, the program is invoked in parallel for each one them. The BPF program used for handling packets is lockless and uses a per-CPU version of BPF maps. Due to this parallelism, performance scales linearly with the number of the NIC’s RX queues. Katran also supports the “generic XDP” mode (instead of driver mode) of operation, at a performance cost.

Inexpensive and more stable hashing: Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers. The last of these is an important feature that allows us to handle hardware refreshes in our PoPs and data centers easily: We can absorb the newer generation hardware by simply setting appropriate weights. Despite its being more expressive, the code for computing this hash is small enough to fit entirely in the L1 cache.

More resilient local state: Katran’s efficiency at handling packets and computing the hash results in an interesting interaction with the local state table. We observed that, quite often, computing the hash is computationally easier than looking up the local state table for the 5-tuple to backend server choice. This is more visible for cases where the local state table lookup traverses all the way to the shared last level cache. In order to take advantage of this phenomenon in a natural way, we implemented the lookup table as an LRU-evicting cache. The LRU cache size is configurable at startup time and acts as a tunable parameter to strike a balance between computation and lookup. We picked these values empirically to optimize for pps. In addition, Katran provides a runtime “compute only” switch to ignore the LRU cache altogether in the event of catastrophic memory pressure on the host.

RSS-friendly encapsulation: Received Side Scaling (RSS) is an important optimization in NICs that aims to spread load across CPUs uniformly by steering packets from each flow to a separate CPU. Katran crafts its encapsulation to work in conjunction with RSS. Instead of using the same outer source for every IP-in-IP packet, packets in different flows (e.g., with different 5-tuples) are encapsulated using a different outer source IP, but packets in the same flow are always assigned the same outer source IP.

Figure 4: Katran enables a fast path for processing packets at high speed without resorting to a full-fledged kernel bypass. Note that the packets cross the kernel/user-space boundary only once. This allows us to colocate the L4LB and backend application without sacrificing performance.

These features dramatically enhance performance, flexibility and scalability of the L4LB. Katran’s design also gets rid of busy loops on receive path barely consuming any CPU if there are no incoming packets. In contrast to a full-fledged Kernel Bypass solution (such as DPDK), using XDP allows us to run Katran alongside any application without any performance penalties on the same host. Katran today runs alongside the backend servers in our PoPs with an improved L4LB-to-backend ratio. This increases resilience to load spikes, host failures, and maintenance, as well. The reengineered forwarding plane was central to this shift. We believe other systems can benefit by using our forwarding plane, so we are open-sourcing our code and including several examples of how to use it to craft an L4LB.

Additional Considerations

Katran operates under certain assumptions and constraints that enable the performance improvements. In practice, we found these constraints to be fairly reasonable, and they did not block our deployment. We believe that most users of our library will find them easy to satisfy. We’ve listed them below:

  • Katran works only in direct service return (DSR) mode.
  • Katran is the component that decides the final destination of a packet addressed to a VIP so the network needs to route packets to Katran first. This requires the network topology to be L3 based, e.g., packets are routed by IP rather than by MAC addresses.
  • Katran cannot forward fragmented packets, nor can it fragment them by itself. This could be mitigated either by increasing the maximal transmission unit (MTU) inside the network or by changing advertised TCP MSS from the backends. (The latter step is recommended even if you have increased the MTU.)
  • Katran doesn’t support packets with IP options set. The maximum packet size cannot exceed 3.5 KB.
  • Katran was built with the assumption that it’s going to be used in a «load balancer on a stick» scenario, where a single interface would be used for traffic both «from user to L4LB (ingress)» and «from L4LB to L7LB (egress).”

Despite these limitations, we believe that Katran offers an excellent forwarding plane to users and organizations who intend to leverage the exciting combination of XDP and eBPF to build efficient load balancers. We look forward to answering any questions from prospective adopters on our GitHub repository — and pull requests are always welcome!

PE-sieve — Hook Finder is open source tool based on libpeconv.

PE-sieve (previously known as Hook Finder) is open source tool based on libpeconv.
It scans a given process, searching for manually loaded or modified modules. When found, it dumps the modified/suspicious PE along with a report in JSON format, detailing about the found indicators.
Currently it detects inline hooks, hollowed processes, Process Doppelgänging, injected PE files etc. In case if the PE file was patched in the memory, it gives a detailed report about where are the changed bytes (and few other properties).

The tool is under rapid development, so expect frequent updates.

PE-sieve is available in 2 versions – as standalone executable, and as a DLL. The DLL version became a base of my other project: HollowsHunter – that makes an automated scan of all the running processes. More about it in the further part of the post.

Where to get it?

The tool is open-source, available on my github:

  https://github.com/hasherezade/pe-sieve

Usage

It has a simple, commandline interface. When run without parameters, it displays info about the version and required arguments:

When you run it giving a PID of the running process, it scans all the PE modules in its memory (the main executable, but also all the loaded DLLs). At the end, you can see the summary of how many anomalies have been detected of which type.

In case if some modified modules has been detected, they are dumped to a folder of a given process, for example:

Short history & features

Detecting inline hooks and patches

I started creating it for the purpose of searching and examining inline hooks. You can see it in action here (old version):

It not only detects that there IS an anomaly/patch, but also WHERE exactly it is. For each dumped PE where the patches were found, it creates a file with tags, that can be loaded by PE-bear.

Thanks to this, we can easily browse the found hooks and check the code that was overwritten.

For example – in the application presented above, the Entry Point was patched and the execution was redirected to the added, malicious section:

Detecting hollowed processes

Later, I extended it to detect process hollowing etc – and it turned out to be pretty convenient unpacker:

Detecting Process Doppelgänging

In a similar manner, it can detects some other methods of impersonating a processes, for example Process Doppelgänging. The malicious payload is directly dumped and ready to be analyzed:

Recovering erased imports

PE-sieve has an ability to recover erased imports. In order to enable it, deploy it with appropriate option. Example – unpacking manually loaded payloads with imports erased (Emotet):

Future development

The project is still not finished and I have many ideas how to make it better. I am planning to detect not only code modifications, but also other types of hooking, such as IAT and EAT patching.

Some in-memory patches are done by legitimate applications, so, in the future version I will provide capability of whitelisting defined patches.

I am also planning to extend its dumping capabilities against the malicious processes that are trying to defend themselves against dumpers etc.

PE-sieve as a DLL

During the development process I got an idea to make a DLL version of the PE-sieve, so that it can be incorporated in other projects.

Building PE-sieve from sources as a DLL is very easy – you just need to set one CMake option: PE_SIEVE_AS_DLL:

build_as_dll

The PE-sieve DLL exposes a minimalistic API. Two functions are exported:

pe-sieve_func

  1. PESieve_help – displays a short info and the version of the DLL.
  2. PESieve_scan – a typical scan with a given parameters, like in the PE-Sieve.exe

The necessary headers needs to be included from the folder “pe-sieve\include“:

https://github.com/hasherezade/pe-sieve/tree/master/include

I have plans to enrich the API in the future. For now, you can see the PE-sieve DLL in action in the HollowsHunter project.

Ideas? Bugs?

If you noticed bug or have an idea for a useful feature, don’t hesitate to mail me or create a Github issue – I check them regularly:

https://github.com/hasherezade/pe-sieve/issues