It’s common, when analysing a kernel crash dump, to look at kernel tasks’ stack backtraces in order to see what the tasks are doing, e.g. what nested function calls led to the current position; this is easily displayed by the crash utility. We often want also to know the arguments to those function calls; unfortunately these are not so easily displayed.
This blog will illustrate some techniques for extracting kernel function call arguments, where possible, from the crash dump. Several worked examples are given. The examples are from the Oracle UEK kernel, but the techniques are applicable to any Linux kernel.
Note: The Python-Crash API toolkit pykdump includes the command fregs, which automates some of this process. However, it is useful to study how to do it manually, in order to understand what’s going on, and to be able to do it when pykdump may not be available, or if fregs fails to produce the desired result.
Basics
This section gives the minimum detail needed to use the techniques. Background explanatory detail will be given in a subsequent section.
You need to know a little bit of the x86 instruction set, but not much. You can get started knowing just that mov %r12,%rdi places the contents of cpu register %r12 into register %rdi, and that mov 0x8(%r14),%rcx takes the contents of register %r14, adds 8 to it, takes the result as the address of a memory location, reads the value at that memory location and puts it into register %rcx.
For the Linux kernel running on the x86-64 architecture, kernel function arguments are normally passed using 64-bit cpu registers. The first six arguments to a function are passed in the following cpu registers, respectively: %rdi, %rsi, %rdx, %rcx, %r8, %r9.
To slightly complicate matters, these 64-bit register contents may be accessed via shorter subsets under different names (for compatibility with previous 32/16/8-bit instruction sets), as shown in the following table:
The contents of these registers are not preserved (for every kernel function call) in the crash dump. However, it is often possible to extract the values that were placed into these registers, from the kernel stack, or from memory.
In order to find out the arguments supplied to a given function, we need to look at the disassembled code for either the calling function, the called function, or both.
If we disassemble the caller’s code, we can see where it obtained the values to place into the registers used to pass the arguments to the called function. We can also disassemble the called function, to see if it stored the values it was passed in those registers, to its stack or to memory.
We may then be able to extract those values from the same place, if that is either on the kernel stack, or in kernel memory (if it has not been subsequently changed), both of which are normally included in the crash dump.
The techniques shown cover different ways that the compiler might have chosen to store the values that are passed in the registers:
The calling function might retrieve the values from:
Memory
The calling function’s stack, via an offset from the (fixed) base of the stack
Another register, which got its value from one of these above in turn
The called function might save the values it received in the registers, to:
Memory
The called function’s stack, via push
Another register, which might itself then be saved in turn
Therefore the technique we use is to:
Disassemble either or both of:
The calling function, leading up to the callq instruction, to see from where it obtains the values it passes to the called function in the registers
The called function, to see whether it puts the values from the registers onto its stack, or into memory
Inspect those areas (caller/callee stack and/or memory) to see if the values may be extracted
In some cases, it might not be possible to use any of the above methods to find the arguments passed. If this is the case, consider looking at another level of the stack: it’s quite common for the same value to be passed down from one function to the next. Thus although you might be unsuccessful trying to recover that argument’s value using the above methods for your function of interest, that same value might be passed down to further functions (or itself been passed down from earlier functions), and you might have more luck finding it looking at one of those other functions, using the methods above. The value might also be itself contained in another structure, which may be passed in a function argument. A knowledge of the code in question obviously helps in this case.
Finding a function’s stack frame base pointer
For some of the methods noted above, we will need to know how to find the (fixed) base of a kernel function’s stack. Whilst the function is executing, this is stored in the %rbp register, the function’s stack frame base pointer. We may find it in two places in the kernel task’s stack backtrace.
To show a kernel task’s stack backtrace, use bt:
Note:-sx tells bt to try to show addresses as as symbol name plus a hex offset.
The above lists the functions calls found in the stack; we also want to see the actual stack content, i.e. the content of the stack frames, for each function call. Let’s say we are interested in the arguments passed to the mutex_lock call; we may therefore need to look at its caller, which is do_last, so let’s concentrate on its stack frame:
Note:-FF tells bt to show all the data for a stack frame, symbolically, with its slab cache name, if appropriate.
The stack frame of the calling function do_last is shown above. Its stack frame base pointer 0xffff88180dc37d58 appears in two locations, shown highlighted with ***.
The stack frame base pointer, for a function, may be found:
As the second-last value in the stack frame above the function (i.e. above in the bt output)
As the location of the second-last value in the stack frame for the function
For now, just use the above to find the value of the stack frame base pointer, for a function, if you need it. The structure of the stack frame will be explained in the following section.
Summary of steps
Note which registers you need, corresponding to the position of the called function’s arguments you need
Refer to the register-naming table above, in case the quantities passed are smaller than 64-bit, e.g. integers, other non-pointer types. The 1st argument will be passed in %rdi, %edi, %di or %dil. Note that all the names contain «di«.
Disassemble the calling function, and inspect the instructions leading up to where it calls the function you’re interested in. Note from where the compiler gets the values it places in those registers
If from the stack, find the caller’s stack frame base pointer, and from there find the value in the stack frame
If from memory, can you calculate the memory address used? If so, read the value from memory
If from another register, from where was that register’s contents obtained? And see case 3.3 below.
Disassemble the first part of the called function. Note where it stores the values passed in the registers you need
If onto the stack, find the called function’s stack frame base pointer, and find the value in the stack frame
If from memory, can you calculate the memory address used? If so, read the value from memory
If the calling function obtained the value from another register (case 2.3 above) does the called function save that register to stack/memory?
If none of the above gave a usable result, see if the values you need are passed to another function call further up or down the stack, or may be derived from a different value.
For example the structure you want is referenced from another structure that is passed to a function elsewhere in the stack trace
Once you’ve obtained answers, perform a sanity check
Is the value obtained on a slab cache? If so, is the cache of the expected type?
Is the value, or what it points to, of the expected type?
If the value is a pointer to a structure, does the structure content look correct? e.g. pointers where pointers are expected, function op pointers pointing to real functions, etc
Read the Caveats section, to understand whether you can rely on the answer you’ve found
At this point, you may either skip directly to the Worked Examples , or read on for more detail.
In more depth
This section gives more background. If you’re in a hurry, skip directly to the Worked Examples, and come back and read this later; it may help in understanding what’s going on, and in identifying edge cases and other apparently odd behaviour.
In Linux on x86-64, the kernel stack grows down, from larger towards smaller memory addresses. i.e. a particular function’s stack frame grows downwards from its fixed stack frame base pointer %rbp, with new elements being added via the push instruction, which first decrements the current stack pointer %rsp (which points to the item on the «top» (lowest memory address) of the stack) by the size of the element pushed onto the stack, then copies its argument to the stack location now pointed at by %rsp, which is left pointing at the new item on top of the stack.
However, the bt command shows the stack in ascending memory order, as you read down the page. Therefore we may imagine the bt display of a stack frame as like a pile of magazines stacked up on the table. The top line shown is the top of the stack, which is stored in the stack pointer register %rsp, and is where new items are pushed onto the stack (magazines added to the pile). The bottom line is the stack frame base (the table), which is fixed, stored in the stack frame base pointer %rbp (yes, I’m neglecting the function’s return address here).
Kernel function stack frame layout
The stack consists of multiple frames, one per function. The layout of a kernel function’s stack frame is as follows (in the ascending memory location order as shown by bt):
This is built-up in stages, as follows. When a function is called the callq instruction does two things:
Pushes the address of the instruction following the callq instruction onto the stack (still the caller’s stack frame, at this point). This will be the return address for the called function.
Jumps to the first instruction in the called function.
At this point, what will become the stack frame for the called function now looks like this:
The compiler has inserted the following preamble instructions before the start of the code for most called functions:
The push puts the caller’s stack frame base pointer on top of the stack frame, which now looks like this:
The mov changes the stack frame base pointer register %rbp to also point to the top of the stack, which now looks like this:
From now on, %rsp gets decremented as we add things to the top of the stack, but %rbp remains fixed, denoting the base of the stack. Below that, at the very bottom, is the return address, which records the location of the instruction after the callq instruction, in the calling function (this is the instruction to which this called function will return, via the retq instruction). From now on, the called function’s stack frame looks like this:
Caveats
The description here applies to arguments of simple type, e.g. integer, long, pointer, etc. Things are not quite the same for more complex types, e.g. struct (as a struct, not a pointer to a struct), float, etc. For more detail, refer to the References.
Remember that the crash dump contains data from when the system crashed, not from the point of execution of the instruction you may be looking at. For example:
Memory content will be that of when the system crashed, which may be many function calls deeper in the stack below where you are looking, some of which may have overwritten that area of memory
Stack frame content will be that of when the function (whose stack frame you’re looking at) called the next-deeper function. If the function you’re looking at went on to modify that stack location, before calling the next-deeper function, that is what you will see when you look at the stack frame
Remember that there may be more than one code path branch within a function, leading to a callq instruction. The different paths may populate the function-call registers from different sources, and/or with different values. Your linear reading of the instructions leading up to the callq may not be the path that the code took in every instance.
FAQ
What about other architectures? 32-bit?
This blog refers exclusively to x86-64; it does not apply to 32-bit x86, which passes arguments on the stack
In this example, we have a hanging task, stuck trying to fsync a file. We want to obtain the struct file pointer, and fl_owner for the file in question.
Let’s start by trying to find it via the arguments to filp_close.
Note:bt -l shows file and line number of each stack trace function call.
int filp_close(struct
file
*filp, fl_owner_t
id
)
typedef void *fl_owner_t;
The compiler will use registers %rdi & %rsi, respectively, to pass the two arguments to filp_close.
Let’s look at the full stack frame for filp_close:
Let’s disassemble the calling function put_files_struct, to see where the compiler obtains the values it will pass in registers to filp_close:
The first argument, the struct file pointer will be passed in register %rdi. The compiler fills that register in this way:
We can’t easily retrieve the first argument using this method, since we don’t know the values of %rcx or %rax.
So how about the second argument? That is passed in register %rsi, which is populated from another register %r13:
Notice that filp_close pushes %r13 onto its stack. This is the first push instruction that is done by filp_closeafter its initial push of %rbp. Let’s look again at the stack frame for filp_close:
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
Referring back to the Basics section, we can identify the stack frame base pointer %rbp for filp_close as 0xffff8807b1f1fc48.
Referring back to the stack frame layout section, we can see that the stack for filp_close starts at the very bottom with its return address put_files_struct+145. The next address «up» (in the bt display) is location 0xffff8807b1f1fc48, which is filp_close‘s stack frame base pointer %rbp. It contains a pointer to the parent (put_files_struct) stack frame base pointer 0xffff8807b1f1fc98. From then on «up» are the normal stack pushes done by filp_close. Since the push of %r13 is the first push (following the preamble push of %rbp), we find it next: 0xffff8800c0b21b80, which is the value of fl_owner_t id.
To find on the stack the content of push number n (following the preamble push of %rbp) we calculate the address: %rbp — (n * 8) In this case, n == 1, the first push, so:
crash7latest> px (0xffff8807b1f1fc48 - 1*8)
$1 = 0xffff8807b1f1fc40
Note:px print expression in hex and read its contents:
crash7latest> rd 0xffff8807b1f1fc40
ffff8807b1f1fc40: ffff8800c0b21b80
Thus we find the value of fl_owner_t id == 0xffff8800c0b21b80.
We could, of course, simply have walked «up» the stack frame visually, counting pushes, rather than manually calculating the address.
We still need to find the first argument, the struct file, but we may find that elsewhere, in another function on the stack… it is also the first argument of:
int vfs_fsync(struct
file
*
file
, int datasync)
and so will be passed in register %rdi.
Here’s the relevant extract from the stack backtrace:
#11 [ffff8807b1f1fbf0] vfs_fsync+0x1c at ffffffff8124123c
#12 [ffff8807b1f1fc00] nfs_file_flush+0x80 at ffffffffc02d2630 [nfs]
Let’s disassemble the caller, leading up to the call:
On line 8 we see that register %rdi — the first argument to vfs_fsync — is populated from register %rbx. (Whilst we’re here, note that the second argument is passed in register %esi, which is the 32-bit subset of the 64-bit register %rsi, since the second argument is an integer: int datasync)
Now disassemble the called function:
We see that vfs_fsync does not save %rbx on its stack, but nor does it alter it before calling vfs_fsync_range. Now disassemble the latter:
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
%rbp is ffff88180dc37ca8, and the value at that location — denoted by (%rbp) — is ffff88180dc37d58
Then we can emulate the effects of the mov/lea instructions, to arrive at the value that do_last put into %rdi:
We can also note that since we’ve offset from the stack frame pointer %rbp, this value is on the stack, and bt will tell us more about it, specifically whether it’s part of a slab cache and, if so, which one:
Address 0xffff88180dc37d10 contains a pointer to something from the dentry slab cache, i.e. a dentry.
At this point, we have the dentry pointer in %rax. The next instruction offsets 0x30 from the dentry:
struct -o shows member offsets when displaying structure definitions; if used with an address or symbol argument, each member will be preceded by its virtual address.
struct -x overrides default output format with hexadecimal format.
So the above is the inode.
The next instruction offsets 0xa8 from the inode:
So the above is the mutex, and this (0xffff881d4603e6a8) is what ends up in %rdi, which becomes the first arg to mutex_lock, as expected:
*void __sched mutex_lock(struct mutex *lock)*
Having found the mutex, we would likely want to find its owner:
Example 3
In this example, we look at a UEK3 crash dump, from a system where processes were spending a lot of time in ‘D’ state waiting for an NFS server to respond. The system crashed since hung_task_panicwas set (which is not a good idea on a production system, and should never be set on an NFS client or server).
Looking at the hung task:
From the hung task traceback, we can see that nfs_getattr is stuck waiting on a mutex. It wants to write-back dirty pages, before performing the getattr call, and it grabs the inode mutex to keep other apps out from writing whilst we’re trying to write-back. So, we need to find out who’s got that inode mutex.
Let’s see how nfs_getattr calls mutex_lock:
The first arg to mutex_lock is the mutex, which is passed in %rdi.
We can see that nfs_getattr fills %rdi from %rdx, before calling mutex_lock, but it also stores %rdx at an offset from its stack frame base pointer %rbp:
So, the mutex is held by another NFS getattr task, that is in the process of performing the write-back. This is likely just part of the normal NFS writeback, blocked by congestion, a slow server, or some other interruption.
As mentioned already in general it is advised never to set hung_task_panic on a production NFS system (client or server).
In this blogpost I’ll explain my recent bypass in DOMPurify – the popular HTML sanitizer library. In a nutshell, DOMPurify’s job is to take an untrusted HTML snippet, supposedly coming from an end-user, and remove all elements and attributes that can lead to Cross-Site Scripting (XSS).
Believe me that there’s not a single element in this snippet that is superfluous 🙂
To understand why this particular code worked, I need to give you a ride through some interesting features of HTML specification that I used to make the bypass work.
Usage of DOMPurify
Let’s begin with the basics, and explain how DOMPurify is usually used. Assuming that we have an untrusted HTML in
htmlMarkup
and we want to assign it to a certain
div
, we use the following code to sanitize it using DOMPurify and assign to the
div
:
div.innerHTML = DOMPurify.sanitize(htmlMarkup)
In terms of parsing and serializing HTML as well as operations on the DOM tree, the following operations happen in the short snippet above:
htmlMarkup
is parsed into the DOM Tree.
DOMPurify sanitizes the DOM Tree (in a nutshell, the process is about walking through all elements and attributes in the DOM tree, and deleting all nodes that are not in the allow-list).
The DOM tree is serialized back into the HTML markup.
After assignment to
innerHTML
, the browser parses the HTML markup again.
The parsed DOM tree is appended into the DOM tree of the document.
Let’s see that on a simple example. Assume that our initial markup is
A<img src=1 onerror=alert(1)>B
. In the first step it is parsed into the following tree:
Then, DOMPurify sanitizes it, leaving the following DOM tree:
Then it is serialized to:
A<img src=»1″>B
And this is what
DOMPurify.sanitize
returns. Then the markup is parsed again by the browser on assignment to innerHTML:
The DOM tree is identical to the one that DOMPurify worked on, and it is then appended to the document.
So to put it shortly, we have the following order of operations: parsing ➡️ serialization ➡️ parsing. The intuition may be that serializing a DOM tree and parsing it again should always return the initial DOM tree. But this is not true at all. There’s even a warning in the HTML spec in a section about serializing HTML fragments:
It is possible that the output of this algorithm [serializing HTML], if parsed with an HTML parser, will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.
The important take-away is that serialize-parse roundtrip is not guaranteed to return the original DOM tree (this is also a root cause of a type of XSS known as mutation XSS). While usually these situations are a result of some kind of parser/serializer error, there are at least two cases of spec-compliant mutations.
Nesting FORM element
One of these cases is related to the FORM element. It is quite special element in the HTML because it cannot be nested in itself. The specification is explicit that it cannot have any descendant that is also a FORM:
This can be confirmed in any browser, with the following markup:
is completely omitted in the DOM tree just as it wasn’t ever there.
Now comes the interesting part. If we keep reading the HTML specification, it actually gives an example that with a slightly broken markup with mis-nested tags, it is possible to create nested forms. Here it comes (taken directly from the spec):
It yields the following DOM tree, which contains a nested form element:
This is not a bug in any particular browser; it results directly from the HTML spec, and is described in the algorithm of parsing HTML. Here’s the general idea:
When you open a
<form>
tag, the parser needs to keep record of the fact that it was opened with a form element pointer (that’s how it’s called in the spec). If the pointer is not
Note that this markup no longer has any mis-nested tags. And when the markup is parsed again, the following DOM tree is created:
So this is a proof that serialize-reparse roundtrip is not guaranteed to return the original DOM tree. And even more interestingly, this is basically a spec-compliant mutation.
Since the very moment I was made aware of this quirk, I’ve been pretty sure that it must be possible to somehow abuse it to bypass HTML sanitizers. And after a long time of not getting any ideas of how to make use of it, I finally stumbled upon another quirk in HTML specification. But before going into the specific quirk itself, let’s talk about my favorite Pandora’s box of the HTML specification: foreign content.
The HTML parser can create a DOM tree with elements of three namespaces:
HTML namespace (
http://www.w3.org/1999/xhtml
)
SVG namespace (
http://www.w3.org/2000/svg
)
MathML namespace (
http://www.w3.org/1998/Math/MathML
)
By default, all elements are in HTML namespace; however if the parser encounters
<svg>
or
<math>
element, then it “switches” to SVG and MathML namespace respectively. And both these namespaces make foreign content.
In foreign content markup is parsed differently than in ordinary HTML. This can be most clearly shown on parsing of
<style>
element. In HTML namespace,
<style>
can only contain text; no descendants, and HTML entities are not decoded. The same is not true in foreign content: foreign content’s
<style>
can have child elements, and entities are decoded.
Consider the following markup:
<style><a>ABC</style><svg><style><a>ABC
It is parsed into the following DOM tree
Note: from now on, all elements in the DOM tree in this blogpost will contain a namespace. So
html style
means that it is a
<style>
element in HTML namespace, while
svg style
means that it is a
<style>
element in SVG namespace.
The resulting DOM tree proves my point:
html style
has only text content, while
svg style
is parsed just like an ordinary element.
Moving on, it may be tempting to make a certain observation. That is: if we are inside
<svg>
or
<math>
then all elements are also in non-HTML namespace. But this is not true. There are certain elements in HTML specification called MathML text integration points and HTML integration point. And the children of these elements have HTML namespace (with certain exceptions I’m listing below).
Consider the following example:
<math><style></style><mtext><style></style>
It is parsed into the following DOM tree:
Note how the
style
element that is a direct child of
math
is in MathML namespace, while the
style
element in
mtext
is in HTML namespace. And this is because
mtext
is MathML text integration points and makes the parser switch namespaces.
MathML text integration points are:
math mi
math mo
math mn
math ms
HTML integration points are:
math annotation-xml
if it has an attribute called
encoding
whose value is equal to either
text/html
or
application/xhtml+xml
svg foreignObject
svg desc
svg title
I always assumed that all children of MathML text integration points or HTML integration points have HTML namespace by default. How wrong was I! The HTML specification says that children of MathML text integration points are by default in HTML namespace with two exceptions:
mglyph
and
malignmark
. And this only happens if they are a direct child of MathML text integration points.
Let’s check that with the following markup:
<math><mtext><mglyph></mglyph><a><mglyph>
Notice that
mglyph
that is a direct child of
mtext
is in MathML namespace, while the one that is a child of
html a
element is in HTML namespace.
Assume that we have a “current element”, and we’d like determine its namespace. I’ve compiled some rules of thumb:
Current element is in the namespace of its parent unless conditions from the points below are met.
If current element is
<svg>
or
<math>
and parent is in HTML namespace, then current element is in SVG or MathML namespace respectively.
If parent of current element is an HTML integration point, then current element is in HTML namespace unless it’s
<svg>
or
<math>
.
If parent of current element is an MathML integration point, then current element is in HTML namespace unless it’s
attributes defined, then all elements on the stack are closed until a MathML text integration point, HTML integration point or element in HTML namespace is seen. Then, the current element is also in HTML namespace.
When I found this gem about
mglyph
in HTML spec, I immediately knew that it was what I’d been looking for in terms of abusing
html form
mutation to bypass sanitizer.
DOMPurify bypass
So let’s get back to the payload that bypasses DOMPurify:
, meaning it is in MathML namespace. Because of that,
style
is also in MathML namespace, hence its content is not treated as a text. Then
</math>
closes the
<math>
element, and now
img
is created in HTML namespace, leading to XSS.
Summary
To summarize, this bypass was possible because of a few factors:
The typical usage of DOMPurify makes the HTML markup to be parsed twice.
HTML specification has a quirk, making it possible to create nested
form
elements. However, on reparsing, the second
form
will be gone.
mglyph
and
malignmark
are special elements in the HTML spec in a way that they are in MathML namespace if they are a direct child of MathML text integration point even though all other tags are in HTML namespace by default.
Using all of the above, we can create a markup that has two
form
elements and
mglyph
element that is initially in HTML namespace, but on reparsing it is in MathML namespace, making the subsequent
style
tag to be parsed differently and leading to XSS.
After Cure53 pushed update to my bypass, another one was found:https://platform.twitter.com/embed/index.html?dnt=false&embedId=twitter-widget-1&frame=false&hideCard=false&hideThread=false&id=1307929537749999616&lang=en&origin=https%3A%2F%2Fresearch.securitum.com%2Fmutation-xss-via-mathml-mutation-dompurify-2-0-17-bypass%2F&theme=light&widgetsVersion=ed20a2b%3A1601588405575&width=550px
I leave it as an exercise for the reader to figure it out why this payload worked. Hint: the root cause is the same as in the bug I found.
The bypass also made me realize that the pattern of
1
div.innerHTML = DOMPurify.sanitize(html)
Is prone to mutation XSS-es by design and it’s just a matter of time to find another instances. I strongly suggest that you pass
RETURN_DOM
or
RETURN_DOM_FRAGMENT
options to DOMPurify, so that the serialize-parse roundtrip is not executed.
As a final note, I found the DOMPurify bypass when preparing materials for my upcoming remote training called XSS Academy. While it hasn’t been officially announced yet, details (including agenda) will be published within two weeks. I will teach about interesting XSS tricks with lots of emphasis on breaking parsers and sanitizers. If you already know that you’re interested, please contact us on training@securitum.com and we’ll have your seat booked!
We arrived at the last post about our Fault Injection research on the ESP32. Please read our previous posts as it provides context to the results described in this post.
During our Fault Injection research on the ESP32, we gradually took steps forward in order to identify the required vulnerabilities that allowed us to bypass Secure Boot and Flash Encryption with a single EM glitch. Moreover, we did not only achieve code execution, we also extracted the plain-text flash data from the chip.
Espressif requested a CVE for the attack described in this post: CVE-2020-13629. Please note, that the attack as described in this post, is only applicable to ESP32 silicon revision 0 and 1. The newer ESP32 V3 silicon supports functionality to disable the UART bootloader that we leveraged for the attack.
UART bootloader
The ESP32 implements an UART bootloader in its ROM code. This feature allows, among other functionality, to program the external flash. It’s not uncommon that such functionality is implemented in the ROM code as it’s quite robust as the code cannot get corrupt easily. If this functionality would be implemented by code stored in the external flash, any corruption of the flash may result in a bricked device.
Typically, this type of functionality is accessed by booting the chip in a special boot mode. The boot mode selection is often done using one or more external strap pin(s) which are set before resetting the chip. On the ESP32 it works exactly like this pin
G0
which is exposed externally.
The UART bootloader supports many interesting commands that can be used to read/write memory, read/write registers and even execute a stub from SRAM.
Executing arbitrary code
The UART bootloader supports loading and executing arbitrary code using the
load_ram
command. The ESP32‘s SDK includes all the tooling required to compile the code that can be executed from SRAM. For example, the following code snippet will print
SRAM CODE\n
on the serial interface.
void __attribute__((noreturn)) call_start_cpu0()
{
ets_printf("SRAM CODE\n");
while (1);
}
The
esptool.py
tool, which is part of the ESP32‘s SDK, can be used to load the compiled binary into the SRAM after which it will be executed.
Interestingly, the UART bootloader cannot disabled and therefore always accessible, even when Secure Boot and Flash Encryption are enabled.
Additional measures
Obviously, if no additional security measures would be taken, leaving the UART bootloader always accessible would render Secure Boot and Flash Encryption likely useless. Therefore, Espressif implemented additional security measures which are enabled using dedicated eFuses.
These are security configuration bits implemented in special memory, often referred to as OTP memory, which can typically only change from 0 to 1. This guarantees, that once enabled, is enabled forever. The following OTP memory bits are used to disable specific functionality when the ESP32 is in the UART bootloader boot mode.
DISABLE_DL_CACHE: disables the entire MMU flash cache
The most relevant OTP memory bit is DISABLE_DL_DECRYPT as it disables the transparent decryption of the flash data.
If not set, it would be possible to simply access the plain-text flash data while the ESP32 is in its UART bootloader boot mode.
If set, any access to the flash, when the chip is in UART bootloader boot mode, will yield just the encrypted data. The Flash Encryption feature, which is fully implemented in hardware and transparent to the processor, is only enabled in when the ESP32 is in Normal boot mode.
The attacks described in this post have all these bits set to 1.
Persistent data in SRAM
The SRAM memory that’s used by the ESP32 is typical technology that’s used by many chips. It’s commonly used to the ROM‘s stack and executing the first bootloader from flash. It’s convenient to use at early boot as it typically require no configuration before it can be used.
We know from previous experience that the data stored in SRAM memory is persistent until it’s overwritten or the required power is removed from the physical cells. After a cold reset (i.e. power-cycle) of the chip, the SRAM will be reset to its default state. This often semi-random and unique per chip as the default value for each bit (i.e. 0 or 1) is different.
However, after a warm reset, where the entire chip is reset without removing the power, it may happen that the data stored in SRAM remains unaffected. This persistence of the data is visualized in the picture below.
We decided to figure out if this behavior holds up for the ESP32 as well. We identified that the hardware watchdog can be used to issue a warm reset from software. This watchdog can also be issued when the chip is in UART bootloader boot mode and therefore we can use it to reset the ESP32 back into Normal boot mode.
Using some test code, loaded and executed in SRAM using the UART bootloader, we determined that the data in SRAM is indeed persistent after issuing a warm reset using the watchdog. Effectively this means we can boot the ESP32 in Normal boot mode with the SRAM filled with controlled data.
But… how can we (ab)use this?
Road to failure
We envisioned that we may be able to leverage the persistence of data in SRAM across warm resets for an attack. The first attack we came up with is to fill the SRAM with code using the UART bootloader and issue a warm reset using the watchdog. Then, we inject a glitch while the ROM code is overwriting this code with the flash bootloader during a normal boot.
We got this ideas as during our previous experiments, where we turned data transfers into code execution, we noticed that for some experiments the chip started executing from the entry address before the bootloader was finished copying.
Sometimes you just need to try it…
Attack code
The code that we load into the SRAM using the UART bootloader is shown below.
#define a "addi a6, a6, 1;"
#define t a a a a a a a a a a
#define h t t t t t t t t t t
#define d h h h h h h h h h h
"movi a10, 0x52; callx8 a6;" // R
"movi a10, 0x61; callx8 a6;" // a
"movi a10, 0x65; callx8 a6;" // e
"movi a10, 0x6C; callx8 a6;" // l
"movi a10, 0x69; callx8 a6;" // i
"movi a10, 0x7A; callx8 a6;" // z
"movi a10, 0x65; callx8 a6;" // e
"movi a10, 0x21; callx8 a6;" // !
"movi a10, 0x0a; callx8 a6;" // \n
while(1);
}
To summarize, the above code implements the following:
Command handler with a single command to perform a watchdog reset
NOP-like padding using
addi
instructions
Assembly for printing
Raelize!
on the serial interface
Please note, the listing’s numbers match the numbers in the code.
Timing
We target a reasonably small attack window at the start of F which is shown in the picture below. We know from previous experiments that during this moment the flash bootloader is copied.
The glitch must be injected before our code in SRAM is entirely overwritten by the valid flash bootloader.
Attack cycle
We took the following steps for each experiment to determine if the attack idea actually works. A successful glitch will print
Raelize!
on the serial interface.
Set pin G0 to low and perform a cold reset to enter UART bootloader boot mode
Use the
load_ram
command to execute our attack code from SRAM
Send an
A
to the program to issue a warm reset into normal boot mode
Inject a glitch while the flash bootloader is being copied by the ROM code
Results
After running these experiments for more than a day, resulting in more than 1 million experiments, we did not observe any successful glitch…
An unexpected result
Nonetheless, while analyzing the results, we noticed something unexpected.
The serial interface output for one of the experiments, which is shown below, indicated that the glitch caused an illegal instruction exception.
These type of exceptions happened quite often when glitches are injected in a chip. This was not different for the ESP32. For most the exceptions the
PC
register is set to a value that’s expected (i.e. a valid address). It does not happen often the
PC
register is set to such an interesting value.
The
Illegal Instruction
exception is caused as there is no valid instruction stored at the
0x661b661b
address. We conclude this value must come from somewhere and that is cannot magically end up in the
PC
register.
We analyzed the code that we load into the SRAM in order to find an explanation. The binary code, of which a snippet is shown below, quickly gave us the answer we were looking for. The value
0x661b661b
is easily identified in the above binary image. It actually represents two
addi a6, a6, 1
instructions of which we implemented 1000 in our test code.
in order to create a landing zone in a similar fashion a NOP-sled is often used in software exploits. We did not anticipate these instructions would end up in the
PC
register.
Of course, we did not mind either. We concluded that, we are able to load data from SRAM into the
PC
register when we inject a glitch while the flash bootloader is being copied by the ROM code .
We quickly realized, we now have all the ingredients to cook up an attack where we bypass Secure Boot and Flash Encryption using a single glitch. We reused some of the knowledge obtained during a previously described attack where we take control of the
PC
register.
Road to success
We reused most of the code that we previously loaded into SRAM using the UART bootloader. Only the payload (i.e. printing) that we intended to execute is removed as our strategy is now to set the
PC
register to an arbitrary value in order to take control.
#define a "addi a6, a6, 1;"
#define t a a a a a a a a a a
#define h t t t t t t t t t t
#define d h h h h h h h h h h
if(cmd == 'A') {
*(unsigned int *)(0x3ff4808c) = 0x4001f880;
*(unsigned int *)(0x3ff48090) = 0x00003a98;
*(unsigned int *)(0x3ff4808c) = 0xc001f880;
}
}
asm volatile ( d );
while(1);
}
After compiling the above code, we overwrite directly in the binary the
addi
instructions with the address pointer
0x4005a980
. This address points to a function in the ROM code that prints something on the serial interface. This allows us to identify when we are successful.
We fixed the glitch parameters to that of the experiment that caused the
Illegal Instruction
exception. After a short while, we successfully identified several experiments during which the address pointer is loaded into the
PC
register. Effectively this provides us with control of the
PC
register and we can likely achieve arbitrary code execution.
Why does this work?
Good question. Not so easy to answer.
Unfortunately, we do not have a sound answer for you. We definitely did not anticipate that controlling the data at the destination could yield control of the
PC
register. We came up with a few possibilities, but we cannot say with full confidence if any of these is actually correct.
One explanation is that the glitch may corrupt both operands of the
ldr
instruction in order to load a value from the destination into the
Moreover, it’s a possibility that the ROM code implements functionality that facilitates this attack. In other words, we may execute valid code within the ROM due to our glitch that causes the value from SRAM to be loaded into the
PC
register.
More thorough investigation is required in order to determine what exactly allows us to perform this attack. However, from an attacker’s perspective, it’s sufficient to realize how to get control of
PC
in order to build the exploit.
Extracting plain-text data
Even though we have control of the
PC
register, we are not yet able to extract the plain-text data from the flash. We decided to leverage the UART bootloader functionality to do so.
We decided to jump directly to the UART bootloader while the chip is in Normal boot mode. For this attack we overwrite the
addi
instructions in the code that we load into SRAM with address pointers to the start of the UART bootloader (
0x0x40007a19
).
The UART bootloader prints a string on the serial interface which is shown below. We can use this to identify if we are successful or not.
waiting for download\n"
Once we observe a successful experiment, we can simply use the
esptool.py
to issue a
read_mem
command in order to access plain-text flash data. The command below reads 4 bytes from the address where the external flash is mapped (
Unfortunately, by jumping directly to the handler, the string that’s printed (i.e.
waiting for download\n"
) is not printed anymore. Therefore, we cannot easily identify successful experiments. Therefore, we decided to simply always send the command, regardless if we are successful or not. We used a very short serial interface timeout in order to minimize the overhead of almost always hitting the timeout.
After a short while, we observed the first successful experiments!
Conclusion
In this post we described an attack on the ESP32 where we bypass its Secure Boot and Flash Encryption features using a single EM glitch. Moreover, we leveraged the vulnerability exploited by this attack to extract the plain-text data from the encrypted flash.
We can use FIRM to break down the attack in multiple comprehensible stages.
Interestingly, two weaknesses of the ESP32 facilitated this attack. First, the UART bootloader cannot be disabled and is always accessible. Second, the data loaded in SRAM is persistent across warm resets and can therefore be filled with arbitrary data using UART bootloader.
Espressif indicated in their advisory related to this attack that newer versions of the ESP32 include functionality to completely disable this feature.
Final thoughts
All standard embedded technologies are vulnerable to Fault Injection attacks. Therefore, it’s not surprising at all that the ESP32 is vulnerable as well. These type of chips are simply not made to be resilient against these type of attacks. However, and this is important, this does not mean that these attacks do not impose a risk.
Our research has shown that leveraging chip-level weaknesses for Fault Injection attack is very effective. We have not seen many public examples yet as most attack still focus on traditional approaches where the focus is mostly on bypassing just a check.
We believe the full potential of Fault Injection attacks is still unexplored. Most research until recently focused mostly on the injection method itself (i.e. Activate, Inject and Glitch) compared to what can be accomplished due to a vulnerable chip (i.e. Fault, Exploit and Goal).
We are confident that creative usage of new and undefined fault models, will give rise to unforeseen attacks, where exciting exploitation strategies are used, for a wide variety of different goals.
Recently I discovered an ACE on Facebook for Android that can be triaged through download file from group Files Tab without open the file.
Background
I was digging on the method that Facebook use to download files from group, I have found that Facebook use tow different mechanism to download files. If the user download the file from the post itself It will be downloaded via built-in android service called DownloadManager as far as I know It safe method to download files. If the user decide to download the file from Files Tab It will be downloaded through different method, In nutshell the application will fetch the file then will save it to Download directory without any filter.
Notice: the selected code is the fix that Facebook pushed. The vulnerable code was without this code.
Path traversal
The vulnerability was in the second method, security measures was implemented on the server side when uploading the files but It was easy to bypass. Simply the application fetch the download file and for example save the file to
/sdcard/Downloads/FILE_NAME
without filter the
FILE_NAME
to protect against path traversal attacks. First idea came to my mind is use path traversal to overwrite native libraries which will leads to execute arbitrary code.
I have set up my burp suite proxy then Intercepted upload file request and modify the filename to
../../../sdcard/PoC
then forward the request.
Web upload endpoint
Unfortunately It wasn’t enough due of the security measures on the server side, my path traversal payload was removed. I decide to play with the payload but unfortunately no payload worked.
Bypass security measures. (Bypass?)
After many payloads, I wasn’t able to bypass that filter. I came back to browse the application again may find something useful, It came!
For first time, I noticed that I can upload files via Facebook mobile application. set-up burp suite proxy on my phone, enable white-hat settings on the application to bypass SSL pinning, intercepted upload file request, modify the filename to
../../../sdcard/PoC
, file uploaded successfully and my payload is in the filename now!
I tried to download the file from the post, but DownloadManger service is safe as I told so the attack didn’t work. Navigated to Files Tab, download the file. And here is our attack. My file was wrote to
/sdcard/PoC
!
As I was able to preform path traversal, I can now overwrite the native libraries and preform ACE attack.
Exploit
To exploit that attack I start new android NDK project to create native library, put my evil code on JNI_OnLoad function to make sure that the evil code will execute when loaded the library.
April 29, 2020 at 5:57 AM: Subbmited the report to facebook. April 29, 2020 at 11:20 AM: Facebook were able to reproduce it. April 29, 2020 at 12:17 PM: Traiged. June 16, 2020 at 12:54 PM: Vulnerability has been fixed. July 15, 2020 at 5:11 PM: Facebook rewarded me $10,000!
Bounty
I noticed people commented on the amount of bounty when I tweet about the bug, It small? I was shocked and objected to it and tried to discuss Facebook, but noway they say that amount is fair and they won’t revisiting this decision. As Neal told me: Spencer provided you with insight into how we determined the bounty for this issue. We believe the amount awarded is reasonable and will not be revisiting this decision.
It’s up to you to decide before you report your vulnerabilities! Vendor or?
The world’s most popular torrent client, uTorrent, contained a security vulnerability — later to be called CVE-2020-8437— that could be exploited by a remote attacker to crash and corrupt any uTorrent instance connected to the internet. As white-hat hackers, my friend (who wishes to remain anonymous) and I reported this vulnerability as soon as we found it and it was quickly fixed. Now, after ample time has been given for users to update, it’s safe to disclose an overview of the vulnerability and how to exploit it.
Torrent Protocol — What You Need To Know
Torrent downloads utilize simultaneous connections to multiple peers (other people downloading the same file), creating a decentralized download network that benefits the collective peer group. Each peer can upload and download data to and from any other peer, eliminating any single point of failure or bandwidth bottleneck, resulting in a faster and more stable download for all peers. Peers communicate with each other using the BitTorrent protocol, which is initiated with a handshake. We’re going to focus on this handshake and the packet following it because that’s all that’s needed for exploiting the CVE-2020-8437 uTorrent vulnerability. Surprisingly convenient. 😊
BitTorrent Handshake
The handshake packet is the first packet the initiating peer sends to another peer. It has 5 fields in a strictly structured format:
Handshake Packet Format
Name Length — 1 byte unsigned int — The length of the string that follows.
Protocol Name — variable length string — The protocol the initiating peer supports. This field is for future compatibility, but is set to “BitTorrent protocol” in all major implementations.
Reserved Bytes — 8 byte bitfield — Each bit represents a protocol extension (functionality) that was not part of the original BitTorrent specification. Modern torrent clients utilize this field to communicate their advanced capabilities, which are then used for an optimized download. Today, the grand majority of torrent clients support the “Extension Protocol” extension (confusing name, I know), the 20th bit in this bit field, that provides a foundation for exchanging information about other extensions. Yes, you understood that correctly: there is an extension bit that allows for even more extensions. I wonder what such a complicated protocol can lead to 😉.
Info Hash — 20 byte SHA1 — Used to identify the torrent the initiating peer wants to download, this is the hash of all the information needed to download the torrent (torrent name, hashes of file sections, file section size, file section count, etc…).
Peer ID — 20 byte buffer — A self-designated random ID the initiating peer gives itself.
Figure 1. BitTorrent handshake packet #1 as seen in Wireshark
After a peer receives a handshake packet, it replies with its own handshake packet in the exact same format.
If both peers set the Extension Protocol bit in the Reserved Bytes field, the peers then exchange further information about extensions, using an “Extended” message handshake.
BitTorrent Extended Message Handshake
The Extended Message Handshake is used by peers to share the exact additional extensions they support and other supplemental information. Unlike the BitTorrent handshake packet we previously examined, which was (practically) statically sized, the Extended Message’s Handshake packet can dynamically grow, allowing the packet to transport a multitude of extension data.
Extended Message Handshake Packet Format
Length — 4 bytes unsigned int — the length of the entire message that follows
BitTorrent Message Type — 1 byte — The BitTorrent message ID of this packet. This is set to 20 (0x14) for Extended Messages
BitTorrent Extended Message Type — 1 byte — The Extended Message ID of this extended message. This is set to 0 for an extension exchange.
M — dynamically sized — a bencoded dictionary of the supplemental extensions supported.
Figure 2. Extended Message Extension Exchange
Bencoded Dictionaries
The M field is a bencoded dictionary, which is a format similar to a python dictionary: string-type keys are associated with values. However, in contrast to python dictionaries, bencoded dictionaries include the length of each string before its value, and “d” and “e” are used instead of “{“ and “}” respectively. Below is an example of a python dictionary and its corresponding bencoded dictionary encoding (newlines and spaces inserted in both formats for clarity).
Figure 3. A Bencoded Dictionary Is Very Similar To A Python Dictionary
Additionally, just as a python dictionary can contain a separate dictionary inside itself (and another dictionary inside that one, etc…), so too can a benencoded dictionary.
Figure 4. Both Formats (But We Only Care About Bencoded Dictionaries) can contain more dictionaries inside themselves
The CVE-2020-8437 Vulnerability
The CVE-2020-8437 vulnerability is in how uTorrent parses bencoded dictionaries — specifically, nested dictionaries. Before the patch (uTorrent 3.5.5 and earlier), uTorrent would use an integer (32 bits) as a bit field to keep track of which layer in the bencoded dictionary it was currently parsing. For example, when uTorrent would parse the first layer, the bit field would hold ‘00000000 00000000 00000000 00000001’, and when uTorrent would parse the second layer, the bit field would hold ‘00000000 00000000 00000000 00000011’. “But what happens if uTorrent parses a bencoded dictionary with more than 32 layers of nested dictionaries?”, my friend and I curiously asked one Thursday night. So we quickly created such a dictionary and fed it into uTorrent’s bencoding dictionary parser. The result:
Figure 5. uTorrent crash message 🥳
Awesome Possum! uTorrent crashed! Further inspection of the crash revealed its source: a null pointer dereference.
Exploiting CVE-2020-8437
There are two easy exploit vectors for CVE-2020-8437: The first is a remote peer sending an Extended Message packet with a malicious bencoded dictionary, and the second is a .torrent file that contains a malicious bencoded dictionary.
Remote Peer Exploit
As described earlier, when two peers that support Extended Messages start communicating with each other, they each send a packet enumerating the various extensions they support. That information about supported extensions is sent as a bencoded dictionary, and since that bencoded dictionary gets parsed by the client, if that dictionary is malicious (having more than 32 nested dictionaries layers), it will trigger CVE-2020-8437. 😊
Torrent File Exploit
.torrent files encapsulate the most basic information a client needs to start downloading torrents. These files are openly and commonly shared on torrent websites, downloaded, and then opened by torrent clients, effectively making these files a possible vehicle for triggering vulnerabilities in those torrent clients. Let me take you on a behind-the-scenes-sneak-peek-never-before-seen-on-live-tv look at the internals of a .torrent file, exposing how simple it is to use it to trigger CVE-2020-8437: a .torrent file is simply a bencoded dictionary saved as a file. So to exploit CVE-2020-8437 from a .torrent file, you just need to save a malicious bencoded dictionary to a file and give that file the .torrent extension. Check out my .torrent file exploit. Please enjoy this proof of concept video of crashing uTorrent with the malicious.torrent file linked above 😉
Instagram, with over 100+ million photos uploaded every day, is one of the most popular social media platforms. For that reason, we decided to audit the security of the Instagram app for both Android and iOS operating systems. We found a critical vulnerability that can be used to perform remote code execution on a victim’s phone.
Our modus operandi for this research was to examine the 3rd party projects used by Instagram.
Many software developers, regardless of their size, utilize open-source projects in their software. We found a vulnerability in the way that Instagram utilizes Mozjpeg, the open source project used as their JPEG format decoder.
In the attack scenario we describe below, an attacker simply sends an image to the victim via email, WhatsApp or other media exchange platforms. When the victim opens the Instagram app, the exploitation takes place.
Tell me who your friends are and I’ll tell you your vulnerabilities
We all know that even the biggest companies rely on public open-source projects and that those projects are integrated into their apps with little to no modifications.
Most companies using 3rd party open-source projects declare it, but not all libraries appear in the app’s About page. The best way to be sure you see all the libraries is to go to the lib-superpack-zstd folder of the Instagram app:
Figure 1. Shared objects used by Instagram.
In the image below, you can see that when you upload an image using Instagram, three shared objects are loaded: libfb_mozjpeg.so, libjpegutils_moz.so, and libcj_moz.so.
Figure 2. Mozjpeg’s shared objects.
The “moz” suffix is short for “mozjpeg,” which is short for Mozilla JPEG Encoder Project but what do these modules do?
What is Mozjpeg?
Let’s start with a brief history of the JPEG format. JPEG is an image file format that’s been around since the early 1990s, and is based on the concept of lossy compression, meaning that some information is lost in the compression process, but this information loss is negligible to the human eye. Libjpeg is the baseline JPEG encoder built into the Windows, Mac and Linux operating systems and is maintained by an informal independent group. This library tries to balance encoding speed and quality with file size.
In contrast, Libjpeg-turbo is a higher performance replacement for libjpeg, and is the default library for most Linux distributions. This library was designed to use less CPU time during encoding and decoding.
On March 5, 2014, Mozilla announced the “Mozjpeg” project, a JPEG encoder built on top of libjpeg-turbo, to provide better compression for web images, at the expense of performance.
The open-source project is specifically for images on the web. Mozilla forked libjpeg-turbo in 2014 so they could focus on reducing file size to lower bandwidth and load web images more quickly.
Instagram decided to split the mozjpeg library into 3 different shared objects:
libfb_mozjpeg.so – Responsible for the Mozilla-specific decompression exported API.
libcj_moz.so – The libjeg-turbo that parses the image data.
libjpegutils_moz.so – The connector between the two shared objects. It holds the exported API that the JNI calls to trigger the decompression from the Java application side.
Fuzzing
Our team at CPR built a multi-processor fuzzing lab that gave us amazing results with our Adobe Research, so we decided to expand our fuzzing efforts to Mozjpeg as well.
The primary addition made by Mozilla on top of libjpeg-turbo was the compression algorithm, so that is where set our sights.
AFL was our weapon of choice, so naturally we had to write a harness for it.
To write the harness, we had to understand how to instrument the Mozjpeg decompression function.
Fortunately, Mozjpeg comes with a code sample explaining how to use the library:
METHODDEF(int)
do_read_JPEG_file(struct jpeg_decompress_struct *cinfo, char *filename)
{
struct my_error_mgr jerr;
/* More stuff */
FILE *infile; /* source file */
JSAMPARRAY buffer; /* Output row buffer */
int row_stride; /* physical row width in output buffer */
if ((infile = fopen(filename, "rb")) == NULL) {
fprintf(stderr, "can't open %s\\n", filename);
return 0;
}
/* Step 1: allocate and initialize JPEG decompression object */
/* We set up the normal JPEG error routines, then override error_exit. */
cinfo->err = jpeg_std_error(&jerr.pub);
jerr.pub.error_exit = my_error_exit;
/* Establish the setjmp return context for my_error_exit to use. */
if (setjmp(jerr.setjmp_buffer)) {
jpeg_destroy_decompress(cinfo);
fclose(infile);
return 0;
}
/* Now we can initialize the JPEG decompression object. */
jpeg_create_decompress(cinfo);
/* Step 2: specify data source (eg, a file) */
jpeg_stdio_src(cinfo, infile);
/* Step 3: read file parameters with jpeg_read_header() */
(void)jpeg_read_header(cinfo, TRUE);
/* Step 4: set parameters for decompression */
/* In this example, we don't need to change any of the defaults set by
* jpeg_read_header(), so we do nothing here.
*/
/* Step 5: Start decompressor */
(void)jpeg_start_decompress(cinfo);
/* JSAMPLEs per row in output buffer */
row_stride = cinfo->output_width * cinfo->output_components;
/* Make a one-row-high sample array that will go away when done with image */
buffer = (*cinfo->mem->alloc_sarray)
((j_common_ptr)cinfo, JPOOL_IMAGE, row_stride, 1);
/* Step 6: while (scan lines remain to be read) */
/* jpeg_read_scanlines(...); */
while (cinfo->output_scanline < cinfo->output_height) {
(void)jpeg_read_scanlines(cinfo, buffer, 1);
/* Assume put_scanline_someplace wants a pointer and sample count. */
put_scanline_someplace(buffer[0], row_stride);
}
/* Step 7: Finish decompression */
(void)jpeg_finish_decompress(cinfo);
/* Step 8: Release JPEG decompression object */
jpeg_destroy_decompress(cinfo);
fclose(infile);
return 1;
}
However, to make sure any crash we found in Mozjpeg impacts Instagram itself, we need to see how Instagram integrated Mozjpeg to their code.
Luckily, below you can see that Instagram copy-pasted the best practice for using the library:
Figure 3. Instagram’s implementation for using Mozjpeg.
As you can see, the only thing they really changed was to replace the put_scanline_someplace dummy function from the example code with read_jpg_copy_loop which utilizes memcpy.
Our harness receives generated image files from AFL and sends them to the wrapped Mozjpeg decompression function.
We ran the fuzzer for only a single day with 30 CPU cores, and AFL notified us about 447 unique “unique” crashes.
After triaging the results, we found an interesting crash related to the parsing of the image dimensions of JPEG. The crash was an out-of-bounds write and we decided to focus on it.
CVE-2020-1895
The vulnerable function is read_jpg_copy_loop which leads to an integer overflow during the decompression process.
Figure 4. Read_jpg_copy_loop code snippet from IDA.
The vulnerable function handles the image dimensions when parsing JPEG image files. Here’s a pseudo code from the original vulnerable code:
width = rect->right - rect->bottom;
height = rect->top - rect->left;
allocated_address = __wrap_malloc(width*height*cinfo->output_components);// <---Integer overflow
bytes_copied = 0;
while ( 1 ){
output_scanline = cinfo->output_scanline;
if ( (unsigned int)output_scanline >= cinfo->output_height )
break;
//reads one line from the file into the cinfo buffer
jpeg_read_scanlines(cinfo, line_buffer, 1);
if ( output_scanline >= Rect->left && output_scanline < Rect->top )
{
memcpy(allocated_address + bytes_copied , line_buffer, width*output_component);// <--Oops
bytes_copied += width * output_component;
}
}
First, let’s understand what this code does.
The _wrap_malloc function allocates a memory chunk based on 3 parameters which are the image dimensions. Both width and height are 16 bit integers (uint16_t) that are parsed from the file.
cinfo->output_component tells us how many bytes represent each pixel.
This variable can vary from 1 for Greyscale, 3 for RGB, and 4 for RGB + Alpha\CMYK\etc.
In addition to height and width, the output_component is also completely controlled by the attacker. It is parsed from the file and is not validated with regards to the remaining data available in the file.
__warp_malloc expects its parameters to be passed in 32bit registers! That means if we can cause the allocation size to exceed (2^32) bytes, we have an integer overflow that leads to a much smaller allocation than expected.
The allocated size is calculated by multiplying the image’s width, height and output_components. Those sizes are unchecked and in our control. When abused, they lead to an integer overflow.
A data of size (width*output_component) is copied (height) times.
It’s a promising-looking bug from an exploitation perspective: a linear heap-overflow gives the attacker control over the size of the allocation, the amount of overflow, and the contents of the overflowed memory region.
Wild Copy Exploitation
To cause the memory corruption, we need to overflow the integer determining the allocation size; our calculation must exceed 32 bits. We are dealing with a wildcopy which means we are trying to copy data that is larger than 2^32 (4GB). Therefore, there is an extremely high probability the program will crash when the loop reaches an unmapped page:
Figure 5. Segfault caused by our wildcopy.
So how can we exploit this?
Before we dive into wildcopy exploitation techniques, we need to differentiate our case from the classic case of wildcopy like in the Stagefright bug. The classic case usually involves one memcpy that writes 4GB of data.
However, in our case there is a for loop that tries to copy X bytes Y times while X * Y is 4GB.
When we try to exploit such a memory corruption vulnerability, we need to ask ourselves a few important questions:
Can we control (even partially) the content of the data we are corrupting with?
Can we control the length of the data we are corrupting with?
Can we control the size of the allocated chunk we overflow?
This last question is especially important because in Jemalloc/LFH (or every bucket-based allocator), if we can’t control the size of the chunk we are corrupting from, it might be difficult to shape the heap such that we could corrupt a specific target structure, if that structure is in a significantly different size.
At first glance, it seems clear that the answer to the first question, about our ability to control the content, is “yes”, because we control the content of the image data.
Now, moving on to the second question – controlling the length of the data we corrupt with. The answer here is also clearly “yes” because the memcpy loop copies the file line by line and the size of each line copied is a multiplication of the width argument and output_component that are controlled by the attacker.
The answer to the 3rd question, about the size of the buffer we corrupt, is trivial.
As it is controlled by `width * height * cinfo->output_components`, we wrote a small Python script that gives us what these 3 parameters should be, according to the chunk size we wish to allocate, considering the effect of the integer overflow:
import sys
def main(low=None, high=None):
res = []
print("brute forcing...")
for a in range(0xffff):
for b in range(0xffff):
x = 4 * (a+1) * (b+1) - 2**32
if 0 < x <= 0x100000:#our limit
if (not low or (x > low)) and (not high or x <= high):
res.append((x, a+1, b+1))
for s, x, y in sorted(res, key=lambda i: i[0]):
print "0x%06x, 0x%08x, 0x%08x" % (s, x, y)
if __name__ == '__main__':
high = None
low = None
if len(sys.argv) == 2:
high = int(sys.argv[1], 16)
elif len(sys.argv) == 3:
high = int(sys.argv[2], 16)
low = int(sys.argv[1], 16)
main(low, high)
Now that we have our prerequisites for exploiting a wildcopy, let’s see how we can utilize them.
To trigger the vulnerability, we must specify a size larger than 2^32 bytes. In practice, we need to stop the wildcopy before we reach the unmapped memory.
We have a number of options:
Rely on a race condition – While the wildcopy corrupts some useful target structures or memory, we can race a different thread to use that now corrupted data to do something before the wildcopy crashes (e.g., construct other primitives, terminate the wildcopy, etc.).
If the wildcopy loop has some logic that can stop the loop under certain conditions, we can mess with these checks and stop after it corrupts enough data.
If the wildcopy loop has a call to a virtual function on every iteration, and that pointer to a function is in a structure in heap memory (or at another memory address we can corrupt during the wildcopy), the exploit can use the loop to overwrite and divert execution during the wildcopy.
Sadly, the first option isn’t applicable here because we are attacking from an image vector. Therefore, we don’t have any control over threads so the race condition option does not help.
To use the second approach, we looked for a kill-switch to stop the wildcopy. We tried cutting the file in half while keeping the same size in the image header. However, we found out that if the library reaches an EOF marker, it just adds another EOF marker, so we end up in an infinite loop of EOF markers.
We also tried looking for an ERREXIT function that could stop the decompression process at runtime, but we learned that no matter what we do, we can never reach a path that leads to ERREXIT in this code. Therefore, the second option isn’t applicable either.
To use the third option, we need to look for a virtual function that gets called on every iteration of our wildcopy loop.
Let’s go back to the loop logic where the memcpy copy occurs:
proccess_data points to another function called process_data_simple_main:
process_data_simple_main(j_decompress_ptr cinfo, JSAMPARRAY output_buf,
JDIMENSION *out_row_ctr, JDIMENSION out_rows_avail)
{
my_main_ptr main_ptr = (my_main_ptr)cinfo->main;
JDIMENSION rowgroups_avail;
/* Read input data if we haven't filled the main buffer yet */
if (!main_ptr->buffer_full) {
if (!(*cinfo->coef->decompress_data) (cinfo, main_ptr->buffer))
return;
main_ptr->buffer_full = TRUE;
}
rowgroups_avail = (JDIMENSION)cinfo->_min_DCT_scaled_size;
/* Feed the postprocessor */
(*cinfo->post->post_process_data) (cinfo, main_ptr->buffer,
&main_ptr->rowgroup_ctr, rowgroups_avail,
output_buf, out_row_ctr, out_rows_avail);
/* Has postprocessor consumed all the data yet? If so, mark buffer empty */
if (main_ptr->rowgroup_ctr >= rowgroups_avail) {
main_ptr->buffer_full = FALSE;
main_ptr->rowgroup_ctr = 0;
}
}
From process_data_simple_main, we can identify 2 more virtual functions that get called in every iteration. They all have a cinfo struct as a common denominator.
What is this cinfo?
Cinfo is a struct that is passed around during the Mozjpeg various functionality. It holds crucial members, function pointers and image meta-data.
Let’s look at cinfo struct from Jpeglib.h
struct jpeg_decompress_struct {
struct jpeg_error_mgr *err;
struct jpeg_memory_mgr *mem;
struct jpeg_progress_mgr *progress;
void *client_data;
boolean is_decompressor;
int global_state
struct jpeg_source_mgr *src;
JDIMENSION image_width;
JDIMENSION image_height;
int num_components;
...
J_COLOR_SPACE out_color_space;
unsigned int scale_num
...
JDIMENSION output_width;
JDIMENSION output_height;
int out_color_components;
int output_components;
int rec_outbuf_height;
int actual_number_of_colors;
...
boolean saw_JFIF_marker;
UINT8 JFIF_major_version;
UINT8 JFIF_minor_version;
UINT8 density_unit;
UINT16 X_density;
UINT16 Y_density;
...
...
int unread_marker;
struct jpeg_decomp_master *master;
struct jpeg_d_main_controller *main; <<-- there’s a function pointer here
struct jpeg_d_coef_controller *coef; <<-- there’s a function pointer here
struct jpeg_d_post_controller *post; <<-- there’s a function pointer here
struct jpeg_input_controller *inputctl;
struct jpeg_marker_reader *marker;
struct jpeg_entropy_decoder *entropy;
. . .
struct jpeg_upsampler *upsample;
struct jpeg_color_deconverter *cconvert
. . .
};
In the cinfo struct, we can see 3 pointers to functions that we can try to overwrite during the overwrite loop and divert the execution flow.
It turns out that the third option is applicable in our case!
Jemalloc 101
Before we dive into the Jemalloc exploitation concepts, we need to understand how Android’s heap allocator works, as well as all of the terms that we focus on in the next chapter – Chunks, Runs,Regions.
Jemalloc is a bucket-based allocator that divides memory into chunks, always of the same size, and uses these chunks to store all of its other data structures (and user-requested memory as well). Chunks are further divided into ‘runs’ that are responsible for requests/allocations up to certain sizes. A run keeps track of free and used ‘regions’ of these sizes. Regions are the heap items returned on user allocations (malloc calls). Finally, each run is associated with a ‘bin.’ Bins are responsible for storing structures (trees) of free regions.
Figure 6. Jemalloc basic design.
Controlling the PC register
We found 3 good function pointers that we can use to divert execution during the wildcopy and control the PC register.
Mozjpeg has its own memory manager. The JPEG library’s memory manager controls allocating and freeing memory, and it manages large “virtual” data arrays. All memory and temporary file allocation within the library is done via the memory manager. This approach helps prevent storage-leak bugs, and it speeds up operations whenever malloc/free are slow.
The memory manager creates “pools” of free storage, and a whole pool can be freed at once.
Some data is allocated “permanently” and is not freed until the JPEG object is destroyed.
Most of the data is allocated “per image” and is freed by jpeg_finish_decompress or jpeg_abort functions.
For example, let’s look at one of the allocations that Mozjpeg did as part of the image decoding process. When Mozjpeg asks to allocate 0x108 bytes, in reality malloc is called with the size 0x777. As you can see, the requested size and the actual size allocated are different.
Let’s analyze this behavior.
Mozjpeg uses wrapper functions for small and big allocations alloc_small and alloc_large.
The allocated “pools” are managed by alloc_small and the other wrapper functions which maintain a set of members that help them monitor the state of the “pools.” Therefore, whenever there is an allocation request, the wrapper functions check if there is enough space left in the “pool.”
If there is space available, the alloc_small function returns an address from the current “pool” and advances the pointer that points to the free space.
When the “pool” runs out of space, it allocates another “pool” using predefined sizes that it reads from the first_pool_slop array, which in our case are 1600 and 16000.
static const size_t first_pool_slop[JPOOL_NUMPOOLS] = {
1600, /* first PERMANENT pool */
16000 /* first IMAGE pool */
};
Now that we understand how Mozjpeg’s memory manager works, we need to figure out which “pool” of memory holds our targeted virtual function pointers.
As part of the decompression process, there are two major functions that decode the image metadata and prepare the environment for later processing. The two major functions jpeg_read_header and jpeg_start_decompress are the only functions that allocate memory until we reach our wild copy loop.
jpeg_read_header parses the different markers from the file.
While parsing those markers, the second and largest “pool” of size 16000 (0x3e80) gets allocated by the Mozjpeg memory manager. The sizes of the “pools” are const values from the first_pool_slop array (from the code snippet above), which means that the Mozjpeg’s internal allocator already used all of the space of the first pool.
We know that our targeted main, coef and post structures get allocated from within the jpeg_start_decompress function. We can therefore safely assume that the rest of the allocations (until we reach our wildcopy loop) will end up being in the second big “pool” including the main, coef and post structures that we want to override!
Now let’s have a closer look on how Jemalloc deals with this type of size class allocation.
Using Shadow to put some light
Allocations returned by Jemalloc are divided into three size classes- small, large, and huge.
Small/medium: These regions are smaller than the page size (typically 4KB).
Large: These regions are between small/medium and huge (between page size to chunk size).
Huge: These are bigger than the chunk size. They are dealt with separately and not managed by arenas; they have a global allocator tree.
Memory returned by the OS is divided into chunks, the highest abstraction used in Jemalloc’s design. In Android, those chunks have different sizes for different versions. They are usually around 2MB/4MB. Each chunk is associated with an arena.
A run can be used to host either one large allocation or multiple small allocations.
Large regions have their own runs, i.e. each large allocation has a dedicated run.
We know that our targeted “pool” size is (0x3e80=16,000 DEC) which is bigger than page size (4K) and smaller than Android chunk size. Therefore, Jemalloc allocates a large run of size (0x5000) each time!
Let’s take a closer look.
(gdb)info registers X0
X0 0x3fc7
(gdb)bt
#0 0x0000007e6a0cbd44 in malloc () from target:/system/lib64/libc.so
#1 0x0000007e488b3e3c in alloc_small () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#2 0x0000007e488ab1e8 in get_sof () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#3 0x0000007e488aa9b8 in read_markers () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#4 0x0000007e488a92bc in consume_markers () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#5 0x0000007e488a354c in jpeg_consume_input () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#6 0x0000007e488a349c in jpeg_read_header () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
We can see that the actual allocated values sent to malloc are indeed (0x3fc7). This matches the large “pool” size of 16000 (0x3e80) plus the sizes of Mozjpeg’s large_pool_hdr, and the actual size of the object that was supposed to be allocated and ALIGN_SIZE(16/32) – 1.
One thing which can really make a huge difference when implementing heap shaping for an exploit is having a way to visualize the heap: to see the various allocations in the context of the heap.
For this we use a simple tool which allows us to inspect the heap state for a target process during exploit development. We used a tool called “shadow” that argp and vats wrote for visualizing the Jemalloc heap.
We performed a debugging session using shadow over gdb to verify our assumptions regarding the large run that we wish to override.
Our goal is to exploit an integer overflow that leads to a heap buffer overflow.
Exploiting these kinds of bugs is all about precise positioning of heap objects. We want to force certain objects to be allocated in specific locations in the heap, so we can form useful adjacencies for memory corruption.
To achieve this adjacency, we need to shape the heap so our exploitable object is allocated just before our targeted object.
Unfortunately, we have no control over free operations. According to Mozjpeg documentation, “most of the data is allocated “per image” and is freed by jpeg_finish_decompress, or jpeg_abort.” This means that all of the free operations occur at the end of the decompression process using jpeg_finish_decompress, or jpeg_abort which is only called after we have finished overriding memory with our wildcopy loop.
However, in our case we don’t need any free operations because we have control over a function which performs a raw malloc with a size that we control. This gives us the power to choose where we want to place our overflowed buffer on the heap.
We want to position the object containing our overflowed buffer just before the large (0x5000) object containing the main/post/coef data structures that performs a call to function pointers.
Figure 7. Visualizing Jemalloc objects on the heap.
Therefore, the simplest way for us to exploit this is to shape the heap so that the overflowed buffer is allocated right before our targeted large (0x5000) object, and then (use the bug to) overwrite the main/post/coef virtual functions address to our own. This gives us full control of the virtual table that redirects any method to any code address.
We know that the targeted object is always at the same (0x5000) large size, and because Jemalloc allocates large sizes from top to bottom, the only thing we need is to place our overflow objects in the bottom of the same chunk where the large target object is located.
Jemalloc’s chunk size is 2MB in our tested Android version.
The distance (in bytes) between the objects doesn’t matter because we have a wildcopy loop that can copy enormous amounts of data line by line (we control the size of the line). The data that is copied is ultimately larger than 2MB, so we know for sure that we will end up corrupting every object on the chunk that is located after our overflow object.
As we don’t have any control over free operations, we cannot create holes that our object will fall to. (A hole is one or more free places in a run.) Instead, we tried looking for holes that happen anyways as part of the image decompression flow, looking for sizes that repeat every time during debugging.
Let’s use the shadow tool to examine our chunk’s layout in memory:
(gdb) jechunk 0x72a6200000
This chunk belongs to the arena at 0x72c808fc00.
addr info size usage
------------------------------------------------------------
0x72a6200000 headers 0xd000 -
0x72a620d000 large run 0x1b000 -
0x72a6227000 large run 0x1b000 -
0x72a6228000 small run (0x180) 0x3000 10/32
0x72a622b000 small run (0x200) 0x1000 8/8
...
...
0x72a638f000 small run (0x80) 0x1000 6/32
0x72a6390000 small run (0x60) 0x3000 12/128
0x72a6393000 small run (0xc00) 0x3000 4/4
0x72a6396000 small run (0xc00) 0x3000 4/4
0x72a6399000 small run (0x200) 0x1000 2/8
0x72a639a000 small run (0xe0) 0x7000 6/128 <===== The run we want to hit!!!
0x72a63a1000 small run (0x1000) 0x1000 1/1
0x72a63a2000 small run (0x1000) 0x1000 1/1
0x72a63a3000 small run (0x1000) 0x1000 1/1
0x72a63a4000 small run (0x1000) 0x1000 1/1
0x72a63a5000 large run 0x5000 - <===== Large targeted object!!!
We are looking for runs with holes, and those runs must be before the large targeted buffer we want to override. A run can be used to host either one large allocation, or multiple small/medium allocations.
Runs that host small allocations are divided into regions. A region is synonymous to a small allocation. Each small run hosts regions of just one size. In other words, a small run is associated with exactly one region size class.
Runs that host medium allocations are also divided into regions, but as the name indicates, they are bigger than the small allocations. Therefore, the runs that host medium allocations are divided into bigger size class regions that take up more space.
For example, a small run of size class 0xe0 is divided into 128 regions:
0x72a639a000 small run (0xe0) 0x7000 6/128
Medium runs of size class 0x200 are divided into 8 regions:
0x72a6399000 small run (0x200) 0x1000 2/8
Small allocations are the most common allocations, and most likely the ones you need to manipulate/control/overflow. As small allocations are divided into more regions, they are easier to control as it is less likely that other threads will allocate all of the remaining regions.
Therefore, to cause the overflowable object to be allocated before the large targeted object, we use our Python script from (Wild Copy Exploitation paragraph). The script helps us generate the dimensions that will cause the malloc to allocate our overflowable object in our targeted small size class.
We constructed a new JPEG image with the sizes to trigger allocation to the small size class of (0xe0) objects and set a breakpoint on libjepgutils_moz.so+0x918.
The address we got back from malloc is the address of our overflowable object (0x72a639ac40). Let’s examine its location on the heap using the jeinfo method from the shadow framework.
(gdb) jeinfo 0x72a639ac40
parent address size
--------------------------------------
arena 0x72c808fc00 -
chunk 0x72a6200000 0x200000
run 0x72a639a000 0x7000
region 0x72a639ac40 0xe0
We are at the same chunk (0x72a6200000) as our targeted large object! Let’s look at the chunk’s layout again to make sure that our overflowable buffer is at the small size class (0xe0) that we aimed to hit.
(gdb) jechunk 0x72a6200000
This chunk belongs to the arena at 0x72c808fc00.
…
...
0x72a639a000 small run (0xe0) 0x7000 7/128 <-----hit!!!
0x72a63a1000 small run (0x1000) 0x1000 1/1
0x72a63a2000 small run (0x1000) 0x1000 1/1
0x72a63a3000 small run (0x1000) 0x1000 1/1
0x72a63a4000 small run (0x1000) 0x1000 1/1
0x72a63a5000 large run 0x5000 - <------Large targeted object!!!
Yesss! Now let’s continue the execution and see what happens when we overwrite the large targeted object.
(gdb) c
Continuing.
[New Thread 29767.30462]
Thread 93 "IgExecutor #19" received signal SIGBUS, Bus error.
0xff9d9588ff989083 in ?? ()
BOOM! Exactly what we were aiming for–the crash occurred while trying to load a function address through the function pointer for our corrupted data from the overflowable object. We got a Bus error (also known as SIGBUS and is usually signal 10) which occurs when a process is trying to access memory that the CPU cannot physically address. In other words, the memory the program tried to access is not a valid memory address because it contains the data from our image that replaced the real function pointer and led to this crash!
Putting everything together
We have a controlled function call. All that is missing for a reliable exploit is to redirect execution to a convenient gadget to stack pivot, and then build an ROP stack.
Now we need to put everything together and (1) construct an image with malformed dimensions that (2) triggers the bug, which then(3) leads to a copy of our controlled payload that (4) diverts the execution to an address that we control.
We need to generate a corrupted JPEG with our controlled data. Therefore, our next step was to determine exactly what image formats are supported by the Mozjpeg platform. We can figure that out from that piece of code below. out_color_space represents the amount of bits per pixel that is determined according to the image format.
switch (cinfo->out_color_space) {
case JCS_GRAYSCALE:
cinfo->out_color_components = 1;
Break;
case JCS_RGB:
case JCS_EXT_RGB:
case JCS_EXT_RGBX:
case JCS_EXT_BGR:
case JCS_EXT_BGRX:
case JCS_EXT_XBGR:
case JCS_EXT_XRGB:
case JCS_EXT_RGBA:
case JCS_EXT_BGRA:
case JCS_EXT_ABGR:
case JCS_EXT_ARGB:
cinfo->out_color_components = rgb_pixelsize[cinfo->out_color_space];
Break;
case JCS_YCbCr:
case JCS_RGB565:
cinfo->out_color_components = 3;
break;
case JCS_CMYK:
case JCS_YCCK:
cinfo->out_color_components = 4;
break;
default:
cinfo->out_color_components = cinfo->num_components;
Break;
We used a simple Python library called PIL to construct a RGB BMP file. We chose the RGB format that is familiar and known to us and we filled it with “AAA” as payload. This file is the base image format that we use to create our malicious compressed JPEG.
from PIL import Image
img = Image.new('RGB', (100, 100))
pixels = img.load()
for i in range(img.size[0]):
for j in range(img.size[1]):
pixels[i,j] = (0x41, 0x41, 0x41)
img.save('rgb100.bmp')
We then used the cjpeg tool from the Mozjpeg project to compress our bmp file into a JPEG file.
Next, we tested the compressed output file to test our assumptions. We know that the RGB format is 3 bytes per pixel.
We verified that the code does set cinfo->out_color_space = 0x2 (JCS_RGB) correctly. However, when we checked our controlled allocation, we saw that the height and width arguments as part of the integer overflow are still multiplied by out_color_components which is equal to 4, even though we started with a RGB format using a 3×8-bits per pixel. It seems that Mozjpeg prefers to convert our image to a 4×8-bits per pixel format.
We then turned to a 4×8-bit pixels format that is supported by the Mozjpeg platform, and the CMYK format met the criteria. We used the CMYK format as a base image to give us full control over all 4 bytes. We filled the image with “AAAA” as the payload.
We compressed it to a JPEG format and added the dimensions that trigger the bug. To our delight, we got the following crash!
Thread 93 "IgExecutor #19" received signal SIGBUS, Bus error.
0xff414141ff414141 in ?? ()
However, we got a weird 0xFF bytes as part of our controlled address even though we constructed a 4×8 bits per pixel image, and the 4th component is not part of our payload.
Bitmap file formats that support transparency include GIF, PNG, BMP, TIFF, and JPEG 2000, through either a transparent color or an alpha channel.
Bitmap-based images are technically characterized by the width and height of the image in pixels and by the number of bits per pixel.
Therefore, we decided to construct a RGBA BMP format file with our controlled alpha channel (0x61) using the PIL library.
from PIL import Image
img = Image.new('RGBA', (100, 100))
pixels = img.load()
for i in range(img.size[0]):
for j in range(img.size[1]):
pixels[i,j] = (0x41, 0x41, 0x41,0x61)
img.save('rgba100.bmp')
Surprisingly, we got the same results as when we used the CMYK malicious JPEG. We still we got an alpha channel of 0xFF as part of our controlled address even though we used a RGBA format as the base for the compressed JPEG, and we had our own alpha channel from the file with the value (0x61). How did this happen? Let’s go back to the code and understand the reason for that odd behavior.
We found the answer in this little piece of code below:
Figure 8. Setting cinfo->out_color_space to RGBA(0xC) as seen in the IDA disassembly snippet.
We found that Instagram decided to add their own const value after jpeg_read_header finished and before calling jpeg_start_decompress.
We used the RGB format from the first test and we saw that Mozjpeg does correctly set cinfo->out_color_space = 0x2 (JCS_RGB). However, from Instagram’s code (see Figure 3) we can see that this value is overwritten by a const value of 0xc which represents the (JCS_EXT_RGBA) format.
This also explains the weird 0xFF alpha channel that we got even though we used a 3×8-bits per pixel RGB object.
After diving further into the code, we saw that value of the alpha channel (0xFF) is hard coded as a const value. When Instagram sets the cinfo->out_color_space = 0xc to point to the (JCS_EXT_RGBA) format, the code copies 3 bytes from our input base file, and then the 4th byte copied is always the hardcoded alpha channel value.
#ifdef RGB_ALPHA
outptr[RGB_ALPHA] = 0xFF;
#endif
Now that we put everything together, we came to the conclusion that no matter what image format is used for the base of the compressed JPEG, Instagram always converts the output file to a RGBA format file.
The fact that 0xff is always added to the beginning means we could have achieved our goal in a big-endian environment.
Little-endian systems store the least-significant byte of a word at the smallest memory address. Because we’re dealing with a little-endian system, the alpha channel value is always written as the MSB (Most Significant Byte) of our controlled address. As we’re trying to exploit the bug in user mode, and the (0xFF) value belongs to the kernel address space, it foils our plans.
Is exploitation possible?
We lost our quick win. One lesson we can learn from this is that real life is not a CTF game, and sometimes one crucial const value set by a developer can ruin everything from an exploitation perspective.
Let’s recall the content from the main website of the Mozilla foundation about Mozjpeg:
“Mozjpeg’s sole purpose is to reduce the size of JPEG files that are served up on the web.”
From what we saw, Instagram will increase memory usage by 25% for each image we want to upload! That’s about 100 million per day!
To quote one sentence from a lecture that Halvar Flake gave in the last OffisiveCon:
“The only person in computing that is paid to actually understand the system from top to bottom is the attacker! Everybody else usually gets paid to do their parts.”
At this point, Facebook already patched the vulnerability so we stopped our exploitation effort even though we weren’t quite finished with it.
We still have 3 bytes overwrite, and in theory we could invest more time to find more useful primitives that could help us to exploit this bug. However, we decided we did enough and we have publicized the important point that we wanted to convey.
The Mozjpeg project on Instagram is just the tip of the iceberg when talking about Mozjpeg. The Mozilla-based project is still widely used in many other projects over the web, in particular Firefox, and it is also widely used as part of different popular open-source projects such as sharp and libvips projects (on the Github platform alone, they have more than 20k stars combined).
Conclusion & Recommendations
Our blog post describes how image parsing code, as a third party library, ends up being the weakest point of Instagram’s large system. Fuzzing the exposed code turned up some new vulnerabilities which have since been fixed. It is likely that, given enough effort, one of these vulnerabilities can be exploited for RCE in a zero-click attack scenario. Unfortunately, it is also likely that other bugs remain or will be introduced in the future. As such, continuous fuzz-testing of this and similar media format parsing code, both in operating system libraries and third party libraries, is absolutely necessary. We also recommend reducing the attack surface by restricting the receiver to a small number of supported image formats.
This field has been researched a lot by various appreciated independent security researchers as well as nationally-sponsored security researchers. Media format parsing remains an important issue. See also other researcher and vendor advisories:
Facebook’s advisory described this vulnerability as an “Integer Overflow leading to Heap Buffer Overflow – large heap overflow could occur in Instagram for Android when attempting to upload an image with specially crafted dimensions. This affects versions prior to 128.0.0.26.128.”
We at Check Point responsibly disclosed the vulnerability to Facebook, who released a patch on (February 10, 2020). Facebook acknowledged the vulnerability and assigned it CVE-2020-1895. The bug was tested for both 32bit & 64bit versions of the Instagram app.
Many thanks to my colleagues Eyal Itkin (@EyalItkin), Oleg Ilushin, Omri Herscovici (@omriher) for their help in this research.