Exploiting CVE-2021-3490 for Container Escapes

Original text by Karsten König

Today, containers are the preferred approach to  deploy software or create build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.

This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes and how the CrowdStrike Falcon® platform can help to prevent and hunt for similar threats.

Original Technique

Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.

Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490  is one of them and can ultimately be used to achieve a kernel read and write primitive.

Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.”  Every eBPF map is described by a 

struct bpf_map
 object, which contains a  field 
ops
 pointing to a 
struct bpf_map_ops
. That struct contains several function pointers for working with the eBPF map. eBPF maps come in different kinds with different definitions for 
ops
 stored at known offsets. For array maps, 
ops
 will be set to point to the kernel symbol 
array_map_ops
. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.

The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called 

ksymtab
. In order to look up the actual name of a stored symbol address, a second table, called 
kstrtab
, is utilized. A pointer to the string in 
kstrtab
 that contains the name is stored as part of every 
ksymtab
 entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of 
array_map_ops
 using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in 
kstrtab
.

Because 

kstrtab
 is mapped after 
ksymtab
, the previously read memory region should contain the pointer to the string in 
kstrtab
. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.

One clarification has to be made about the above code excerpt: There are two different formats of 

ksymtab
. Which one is used is decided when the Linux kernel is built from source by the configuration parameter 
HAVE_ARCH_PREL32_RELOCATIONS
. In one case, the actual addresses of the symbols and 
kstrtab
 entries are stored. However, in many kernel builds it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in 
kstrtab
.

Nevertheless, using this technique, the exploit identifies the address of 

init_pid_ns
. This object is a 
struct pid_namespace
 and describes the default process ID namespace new processes are started in.

Namespaces have become a  fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the 

init
 process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.

By identifying 

init_pid_ns
, it is possible to enumerate all 
struct task
 objects of the processes running in that namespace as those are stored in a traversable radix tree in the field 
idr
. The exploit identifies the correct 
task
 object by its PID. Those 
task
 objects store a pointer to a 
struct cred
 object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the 
cred
 object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the 
root
 user.

However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.

Why This Doesn’t Work in Containers

Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.

First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability 

SYS_ADMIN
 is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing 
seccomp
. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration, 
seccomp
 does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.

Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by 

init_pid_ns
 can observe any process running on a particular system. Nevertheless, even if we identify the 
task
 structure of our exploit process within a container, overwriting its 
cred
 object as described previously will simply elevate privileges within the container but not allow container escape.

Changes for Container Escapes

It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to 

root
 on the host. To easily find the exploit process in a container, an exploit can search for the symbol 
current_task
 and 
pcpu_base_addr
 symbols in 
ksymtab
current_task
 stores the offset to the running process’s 
task
 object based on the address stored in 
pcpu_base_addr
. Because 
pcpu_base_addr
 is unique per CPU core, the process must be pinned before on one core using the 
sched_setaffinity()
 system call
.

Using this technique it is possible to identify the correct 

task
 object without traversing the radix tree of all processes stored in 
init_pid_ns
.

This allows the attacker to overwrite the correct 

cred
 object and therefore obtain 
root
privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The 
task
object contains a pointer to a 
struct fs_struct
 object. This object contains information about the observable file system, i.e.,which directory is considered as the processes’ file system root. Using the leaked pointer to 
init_pid_ns
, it is possible to traverse the process radix tree and identify the host’s 
init
 process, which has PID 1. Next, it is possible to retrieve the 
fs
pointer from this process’s 
task
 object. Lastly, while overwriting the 
cred
 object of the exploit process, the 
fs
 pointer must be overwritten as well using the 
init
 process’s 
fs
 pointer. The exploit process can then observe the complete host file system.

One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the 

task
 object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the 
init
 process’ credentials
.

The technique described in this blog to identify the 

task
 object of the exploit’s process only works on Linux kernel version before 5.15, as 
pcpu_base_addr
 is no longer exported as a symbol to 
ksymtab
. Nevertheless, alternative methods exist to find the correct 
task
 object, e.g., by traversing the radix tree of all processes from 
init_pid_ns
 and matching on features of the exploit process other than the PID, such as the 
comm
 member of 
struct task
 that contains the executable name.

Container Escape Mitigations

Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. The CrowdStrike Falcon platform can assist in preventing attacks using similar techniques for privilege escalation. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.

  1. Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
  2. Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with CrowdStrike Falcon® Cloud Workload Protection (CWP), as discussed in point 4 below.
  3. Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct Seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the 
    bpf
     system call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
  4. Monitor host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors. CrowdStrike can help with this. Falcon Cloud Workload Protection identifies any indicators of misconfiguration (IOMs) in your containerized environment to uncover a weakness. Falcon Cloud Workload Protection prevents and detects malicious activity on your host and containers to prevent and detect — in real time — breaches by eCrime and nation-state adversaries. For example, if a privileged container or a container without a seccomp profile is executed, the following notifications would appear:

Also, Falcon CWP helps to hunt for threats using the eBPF subsystem to escalate privileges by logging if the 

bpf
 system call was used by a process.

Conclusion

Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.

This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly. CrowdStrike Falcon Cloud Workload Protectioncan assist in identifying and hunting for weaknesses in the deployed configuration that could lead to a compromise.

Additional Resources