How a CPU works: Bare metal C on my RISC-V toy CPU

How a CPU works: Bare metal C on my RISC-V toy CPU


I always wanted to understand how a CPU works, how it transitions from one instruction to the next and makes a computer work. So after reading Ken Shirrif’s blog about a bug fix in the 8086 processor I thought: Well, let’s try to write one in a hardware description language. This post is a write up of my learning experiment.

I’ll walk through my steps of creating an emulator, compiling and linking C for bare metal, CPU design and finally the implementation of my toy RISC-V CPU.


  • implement a CPU in a hardware description language (HDL),
  • code must be synthesizable (except memory),
  • simulate it,
  • and run a bare metal C program on it.

While I had plenty research time, I had only about 30 hours of development time. Without prior hardware design experience the goals had to be simple enough:

  • RISC V Integer instruction set only (minus system and break calls),
  • no interrupt handling or other complexities,
  • no or only minimal optimizations.

My test program written in C is Conway’s game of life, try the interactive web version.

In a future iteration I plan to run my project on a real FPGA and implement a memory controller. Let’s say I made it to about 85% of what I wanted to achieve.

Write an emulator 

Writing an emulator, i.e. a program that can execute the instructions of a CPU, is an excellent stepping stone towards a hardware implementation. For me — without a hardware background — it is much easier to reason about and learn the instruction set.

So my first step was to understand the RISC V instruction set. The RISC V specification is quite long, but I only needed chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (assembly programmer’s handbook). These give detailed definitions of how each instruction must be executed, what kind of registers must be implemented, etc.

Let’s look at an example: Arithmetic / logical instructions operating on a register and an immediate value. For 

 (add immediate) the following is done: 
rd <- rs1 + imm
: The register identified by rd is set to the sum of the value stored in register rs1 and the value given in the instruction itself.

The emulator is implemented in C++ with a C-style interface in a shared library. This makes it easy to hook it into Python via cffi. This saved me quite a bit of time for file system and user interactions, which were all done in Python.

truct RV32EmuState;

extern "C" RV32EmuState* rv32emu_init();
extern "C" void rv32emu_free(RV32EmuState *state);

extern "C" void rv32emu_set_rom(RV32EmuState *state, void *data, uint32_t size);
extern "C" void rv32emu_read(RV32EmuState *state, uint32_t addr, uint32_t size, void *p);

extern "C" int32_t rv32emu_step(RV32EmuState *state);
extern "C" int32_t rv32emu_run(RV32EmuState *state, uint32_t *breakpoints, uint32_t num_breakpoints);

extern "C" void rv32emu_print(RV32EmuState *state);

The core of the emulator is the 

 function, which executes exactly one instruction and then returns. In RISC V each instruction is exactly 32 bits. It first decodes the op code (what kind of operation) and then the specific operation (e.g. ADD immediate). It’s a large, nested switch.

int32_t rv32emu_step(RV32EmuState *state) {
    uint32_t *instr_p = (uint32_t*)(_get_pointer(state, state->pc));
    uint32_t instr = *instr_p;

    // decode
    uint32_t opcode = instr & 0x7F; // instr[6..0]

    switch(opcode) {
        // ... all opcode types ...
        case OPCODE_OP_IMM: {
            uint32_t funct3 = (instr >> 12) & 0x07; // instr[14..12]
            uint32_t rs1 = ((instr >> 15) & 0x1F);

            uint32_t imm = (instr >> 20) & 0x0FFF; // 12 bits
            uint32_t imm_sign = instr & (1ul << 31);
            if(imm_sign) {
                imm |= 0xFFFFF000;

            uint32_t data1 = state->reg[rs1];
            uint32_t data2 = imm;
            uint32_t reg_dest = (instr >> 7) & 0x1F;
            if(reg_dest != 0) { // register 0 is always zero and never written too
                switch(funct3) {
                    // ... all OP_IMM instructions ...
                    case FUNCT3_OP_ADD: {
                        state->reg[reg_dest] = data1 + data2;

    // ...

    state->pc += 4; // go to next instruction for non-jump / non-branch
    return 0;

The mathematical and logical operations are simplest to implement, so I started with them. Iteratively I’ve added branches, jumps and the remaining logic until I had covered all instructions other than 

. These two were not necessary for my bare metal experiment.

For testing I relied on simple hand-written assembly code. Of course this did not exercise my emulator thoroughly. So as a next step I wanted to finally run my Conway’s game of life simulation.

Cross compiling, linking and ELF to bin 

Going from C to a bare metal CPU takes a few steps: cross compile, ensure proper memory layout and converting the ELF file to a binary blob. Also instead of having a 

 function my code has a 
function defined as follows:

void __attribute__((section (".text.boot"))) _start() {
    run(); // call the actual "entrypoint"

I’ll explain the details later.

My CPU only supports the RISC-V 32 bit integer instruction set, but my host system is running on x86-64. So I needed a cross compiler and used the Ubuntu package 

. Then I could compile my code using the following command:

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 \
    -nostdlib -ffreestanding -Tprograms/memory_map.ld \
    -o life.rv32.elf life.c

Let’s take this apart:

  1. execute the RISC-V cross-compiler
  2. set it’s architecture to rv32i, which is RISC-V 32bit integer instruction set
  3. define the application binary interface, i.e. conventions how to emit assembly. This makes it so that integers, longs and pointers are 32 bit
    • these three are needed to emit code compatible with my emulator and later CPU
  4. compile without a standard library
    • Standard libraries like the system libc assume operating system support, but my toy CPU will be running bare metal. So we have to switch that off. This means we won’t have access to 
       etc. Instead we’ll need to implement this ourselves, if we want to use it. This means we have no startup code either.
  5. compile freestanding, that is, do not assume presence of a operating system or library and switch of libc specific optimizations and defaults
    • for example we won’t have a main function, which is otherwise required
  6. use a memory map
    • we need to tell the compiler and linker where instructions and global variables will be placed in memory. We do not have a loader to do this at application startup

Even though we do not yet have any hardware, we must make a few decisions for item 6: how should our address space look like?

  • program execution starts at 0x1000, which below I’ll call rom for read-only-memory
  • memory for globals, stack variables and heap will be located at 0x10000000

These values are kind of arbitrary. I wanted to avoid having code at address zero to avoid issues with NULL pointers. This script also ensures that our program entry point, the function 

 is placed at 0x1000, so that the emulator will execute that code first. Here’s my linker script for my address space setup:


    rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
    ram (rw): ORIGIN = 0x10000000, LENGTH = 32M

    .text : {
            entry point is expected to be the first function here
            --> we are assuming there's only a single function in the .text.boot segment and by convention that is "_start"

            KEEP ensures that "_start" is kept here, even if there are no references to it

            all other code follows
    } > rom

    .rodata : { *(.rodata*) } > rom

    .bss : { *(.bss*) } > ram

After compilation we can check that 

 is actually at 0x1000:

riscv64-unknown-elf-readelf -s life.rv32.elf | grep '_start$$'

Now the “problem” is that gcc generates an ELF and not just a stream of instructions. The Executable and Linkable Format is simplified a container to store executable code, data and metadata in a way that makes it easy to later load into memory. As specified by the memory map like the one above. Since my program is fairly it simple does not need memory initialization. So we can simply dump the RISC-V instructions from the .text segment into a binary file.

riscv64-unknown-elf-objdump -O binary life.rv32.elf life.rv32.bin

So now the C snippet from above should make more sense:

void __attribute__((section (".text.boot"))) _start() {

We are defining a function 

 which should go into the segment 
. The linker script instructs the toolchain to make sure this code is placed at 0x1000, even when no other code references it. By having exactly one function in 
 this is guaranteed to happen.

Turns out this is still not enough to make the code work. The startup code above does not initialize the stack pointer, i.e. where local variables live in memory. I decided to simplify things and hard-code the initial stack pointer value in my emulator and CPU. This means simply setting register 

 also known as 
 to the end of the memory, here 0x12000000.

A couple other registers defined in the ABI with special purpose are not used by my program, so I did not implement support: global pointer 

 and thread pointer 

No standard library 

When the program is running on my host I rely on the standard library for memory allocation like 

 for output. But when running bare metal these functions are not available.

I’ve replaced dynamic memory allocation with static memory assignments. Since my program is the only one running on CPU, I can use all resources how I see fit. If the flag 

 is set, when the program is compiled for my RISC-V emulator / CPU. Without it, the program can run as-is on my host system like any other program.

void run() {
        map0 = (unsigned char*)0x10080000;                              // gamestate
        map1 = map0 + 0x80000;                                          // new gamestate
        leds = map1 + 0x80000;                                          // output for emulator
        map0 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
        map1 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));

    // ...

Instead of relying on 

 for output to the console, my program assumes that the address of the variable 
 is memory-mapped to an external LED array. In case of the emulator, it will simply read this memory area and display it on console. When running in the simulator (or FPGA in the next iteration), the memory controller will set output pins accordingly.

Emulator in action 

Here’s the result of all of that work: First setting a breakpoint for each game of life cycle, and then manually stepping through the program on the emulated CPU.

Best viewed in full-screen mode due to web-unfriendly layout.

CPU overview 

With the emulator completed I now have a working reference system to debug my CPU. And so I started working implementing it.

A simple CPU consists of the following components:

  • Arithmetic Logic Unit (ALU): the compute part, for operations like “add” or “xor”
  • Register File: provides and stores register values
  • Decoder: transform instruction to a set of control signals, controlling the CPU operation
  • Program Counter: manages the address where the next instruction is found
  • Load Store Unit (LSU): connects the CPU to its memory
  • Control Unit: tieing all the parts together to form a CPU

These are the basic elements of a CPU and sufficient for my toy RISC-V implementation.

Describing hardware in code 

Hardware is designed with special programming languages, called hardware description languages (HDL). The most common ones are Verilog and VHDL. For my project I decided to use Amaranth HDL, because it’s higher level and easier to use — plus it’s written in my favorite language Python. Simplified it enables an engineer to describe a program in Python that generates a hardware description, instead of directly describing it directly in Verilog or VHDL. A nice property of Amaranth HDL is that by design the resulting programs are synthesizable, i.e. they can be “compiled” into a description executable in FPGAs or built as an ASIC.

A key difference between software and hardware is concurrency: In software code is executed line by line, in order and we need special constructs like threads to achieve parallelism. In hardware it’s different: Everything is happening at the same time. We are not describing high-level operations, but rather how logic gates are connected to each other.

Combinational logic 

There are two key concepts in hardware: combinational logic (sometimes also called combinatorial logic) and synchronous logic. Simplified combinational logic executes all the time and all at the same time. In the following example green are input signals, yellow are internal signals (output of logic and input to the next logic), blue is logic and orange is the final output signal:

Combinational logic always updates its output immediately when any input changes. There are a couple physical limitations here, but we’ll simplify this for now. This means changing any signal will immediately change the output signal 


In Amaranth we can implement this as

# to run tests: python3 -m pytest

import pytest

from amaranth import Signal, Module, Elaboratable
from import Platform
from amaranth.sim import Simulator, Settle

class Add3Comb(Elaboratable):
    def __init__(self):
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # technically this is not needed: a second `+` below would do
        #     but let's build the circuit exactly as shown above
        temp_sum = Signal(32)

        # define how our logic works
        m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
        m.d.comb += self.result.eq(self.count_3 + temp_sum)

        return m

def test_add3comb():
    # set up our device under test
    dut = Add3Comb()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # check that the sum is the expected value
        assert (yield dut.result) == 42

    sim = Simulator(dut)

The key takeaway here is that the lines

 m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
 m.d.comb += self.result.eq(self.count_3 + temp_sum)

are executed at the same time and also whenever the inputs change.

Synchronous Logic 

There’s a second kind of commonly used logic: synchronous logic. The difference to combinational logic is that outputs only change on a clock edge. I.e. when the clock signal goes from low to high (positive edge) or vice versa (negative edge). Let’s use the adder example again. Colors as before, but we’ll use turquoise for synchronous logic.

We use positive edge triggered logic here. So unless the clock goes from low to high, both 

temp sum
 will never change. The following table shows how values change. Let’s furthermore assume the logic was just resetted, so outputs start at 0.

Changes highlighted in bold. This circuit takes on the expected value only after two full clock cycles. Even if the input signals are not defined in the time period after a positive edge and the next positive edge, this will not change the output in any way

Physically things are more complex (“delta time”) and this results in interesting tradeoffs between the length of combinational logic paths (number of gates, circuit length) and the attainable clock speed. Luckily this does not matter for my toy CPU.

In Amaranth we can implement this as

class Add3Sync(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        temp_sum = Signal(32)

        # define how our logic works
        m.d.sync += temp_sum.eq(self.count_1 + self.count_2)
        m.d.sync += self.result.eq(self.count_3 + temp_sum)

        return m

def test_add3sync():
    # set up our device under test
    dut = Add3Sync()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # no positive edge yet, so still at reset value
        assert (yield dut.result) == 0

        # trigger a positive edge on the clock and wait for things to settle down
        yield Tick()
        yield Settle()

        # count3 is reflect in output, since temp sum is still zero
        assert (yield dut.result) == 21

        yield Tick()
        yield Settle()

        # now both count3 and temp sum will be reflected in the output
        assert (yield dut.result) == 42

    sim = Simulator(dut)

CPU design 

Armed with this knowledge I figured out which things needed to happen in parallel and which things in sequence.

So if we have an ALU related instruction it would work like this:

  1. in parallel
    • read instruction from ROM at the instruction address,
    • decode the instruction,
    • read register values and if present immediate value,
    • compute the result in the ALU,
    • assign ALU result to destination register (not yet visible!)
    • increment instruction address by 4 bytes (not yet visible!)
  2. wait for positive clock edge, giving step 1 time to settle, and in the following instant
    • update instruction address, making the new value visible
    • update destination register value, making the new value visible
  3. repeat, starting at 1.

Iteratively creating a diagrams of how things should work was immensely helpful. Below is a simplified version of my CPU design, though it lacks many of the control signals and special cases for operations related to jumping and branching. Please view in full screen mode, where you can also toggle ALU or LSU layers to make it easier to read. Colors here are just to help with readability of the diagram.

Now let’s talk about the CPU components in detail. Designing them reminded me a lot about functional programming, where the parameters of a function and types naturally guide the implementation. All necessary details about the RISC-V instruction set are specified in detail in the spec.



data1 $OPERATION data2
. I decided to merge the branch unit into the ALU, so there’s a branch for 

class ALU(Elaboratable):
    def __init__(self):
        # if set to 0, then normal ALU operation,
        #     otherwise treat funct3 as branch condition operator
        self.i_is_branch = Signal(1)

        # operation, e.g. "add" or "xor", from decoder
        self.i_funct3 = Signal(3)

        # sub-operation, e.g. "sub" for "add", from decodert
        self.i_funct7 = Signal(7)

        # value of register 1
        self.i_data1 = SignedSignal(32)

        # value of register 2 or immediate
        self.i_data2 = SignedSignal(32)

        # computation result
        self.o_result = SignedSignal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # this ALU also implements branch logic
        with m.If(self.i_is_branch == 0):
            # normal ALU
            with m.Switch(self.i_funct3):
                with m.Case(FUNCT3_OP_XOR):
                    m.d.comb += self.o_result.eq(self.i_data1 ^ self.i_data2)
                with m.Case(FUNCT3_OP_SLL):
                    shift_amount = self.i_data2[0:5]
                    m.d.comb += self.o_result.eq(
                        self.i_data1.as_unsigned() << shift_amount)
                # ...

In this snippet you can see how Amaranth is really a code generator: Instead of using the normal 

 statements, which control Python flow, you have to use the 
 etc. methods on the module.


Provide all necessary signals to control execution.

class InstructionDecoder(Elaboratable):
    def __init__(self):
        self.i_instruction = Signal(32)
        self.i_instruction_address = Signal(32)

        # select signals for register file
        self.o_rs1 = Signal(5)
        self.o_rs2 = Signal(5)
        self.o_rd = Signal(5)
        self.o_rd_we = Signal(1)

        # ALU / LSU operations
        self.o_funct3 = Signal(3)
        self.o_funct7 = Signal(7)

        # immediate value
        self.o_imm = SignedSignal(32)

        # control signals
        self.o_invalid = Signal(1)
        self.o_has_imm = Signal(1)
        self.o_is_branch = Signal(1)
        self.o_is_memory = Signal(2)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.comb += self.o_invalid.eq(0)
        m.d.comb += self.o_is_branch.eq(0)

        opcode = self.i_instruction[0:7]

        with m.Switch(opcode):
            with m.Case(OPCODE_OP_IMM):
                # rd = rs1 $OP imm
                # use ALU with immediate
                m.d.comb += [
            # ...

Register File 

Implement 32 registers. Special case: register 

 is hard-wired to zero, by not allowing writes to it.

I’m not sure if this is the best implementation, but it works well in simulation so far.

class RegisterFile(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        self.i_select_rs1 = Signal(5)
        self.i_select_rs2 = Signal(5)
        self.i_select_rd = Signal(5)

        self.i_we = Signal(1)
        self.i_data = SignedSignal(32)

        self.o_rs1_value = SignedSignal(32)
        self.o_rs2_value = SignedSignal(32)

        self.registers = Signal(32 * 32)

        self.ports = [self.sync, self.i_select_rs1, self.i_select_rs2, self.i_select_rd, self.i_data, self.i_we]

    def elaborate(self, _: Platform) -> Module:
        on clock edge if i_we is set: stores i_data at reg[i_select_rd]
        combinationally returns register values

        m = Module()

        m.d.comb += [
            self.o_rs1_value.eq(self.registers.word_select(self.i_select_rs1, 32)),
            self.o_rs2_value.eq(self.registers.word_select(self.i_select_rs2, 32)),

        with m.If((self.i_we == 1) & (self.i_select_rd != 0)):
            m.d.sync += self.registers.word_select(self.i_select_rd, 32).eq(self.i_data)

        return m

Program Counter 

The simplest component: we start executing programs at 0x1000 and then go to the next instruction. The decoder computes offset based on the instruction to allow both absolute and relative jumps.

class ProgramCounter(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.i_offset = SignedSignal(32)
        self.o_instruction_address = Signal(32, reset=0x1000)

        self.ports = [self.o_instruction_address]

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.sync += self.o_instruction_address.eq(self.o_instruction_address + self.i_offset)

        return m

Load Store Unit 

I’m co-simulating this part, so there is no implementation. Also simulating even small amounts of memory turned out to be way to slow. I hope to find more time in the future to complete this part of the project. And then also run it with real memory on an FPGA instead of just in simulation.

But let’s at least discuss the most interesting aspect of memory: Memory is usually very slow compared to the CPU. So the CPU has to be stalled, i.e. wait, while we are waiting for the memory to execute the read or write. In my design I have defined an 

 signal. This signal tells the control unit to not advance the program counter until the result is available. Not sure if this is the best approach, but it works for now.

class LoadStoreUnit(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        # from decoder, not in spec, internal control signals
        self.i_lsu_mode = Signal(2)

        # from decoder, memory operation
        self.i_funct3 = Signal(3)

        # address
        self.i_address_base = Signal(32)
        self.i_address_offset = SignedSignal(12)

        # reading / writing
        self.i_data = Signal(32)
        self.o_data = Signal(32)

        self.o_done = Signal(1)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # empty by design: this is co-simulated

        return m

Tieing it all together and testing it 

The control unit connects all modules, as described in the simplified diagram above. And uses the control logic from the decoder to correctly advance the program counter.

Instead of showing the boring glue code, here’s how I’m testing the CPU via simulation. The assembly program is designed to set registers to certain values, that can be checked afterwards. It does not follow any ABI constraints.

def test_cpu():
    dut = CPU()
    sim = Simulator(dut)

    rom = [
        0x02A00093,  # addi x1 x0 42   --> x1 = 42
        0x00100133,  # add x2 x0 x1    --> x2 = 42
        0x123451B7,  # lui x3 0x12345  --> x3 = 0x12345
        0x00208463,  # beq x1 x2 8     --> skip the next instruction
        0x00700193,  # addi x3 x0 7    [skipped]
        0x00424233,  # xor x4 x4 x4    --> x4 = 0
        0x00A00293,  # addi x5 x0 10   --> x5 = 10
        0x00120213,  # addi x4 x4 1    --> x4 = x4 + 1
        0x00520463,  # beq x4 x5 8     --> skip the next instruction
        0xFF9FF36F,  # jal x6 -8       --> jump up; effectively setting x4 = 10
                     #                     also setting x6 = pc + 4
        0x000013B7,  # lui x7 0x1      --> x7 = 0x1000
        0x03438467,  # jalr x8 x7 52   --> skip the next instruction
        0x00634333,  # xor x6 x6 x6    [skipped]
        0x100004B7,  # lui x9 0x10000  --> x9 = 0x1000_0000
        0x0324A503,  # lw x10 50 x9    --> x10 = *((int32*)(mem_u8_ptr[x9 + 0x32]))
        0x00000013,  # nop

    ram = [0 for x in range(128)]
    ram[0x32 + 3], ram[0x32 + 2], ram[0x32 + 1], ram[0x32] = 0xC0, 0xFF, 0xEE, 0x42

    done = [0]

    def bench():
        assert (yield dut.o_tmp_pc) == 0x1000

        while True:
            instr_addr = yield dut.o_tmp_pc
            print("instr addr: ", hex(instr_addr))
            rom_addr = (instr_addr - 0x1000) // 4

            if rom[rom_addr] == 0:
                done[0] = 1
                print("bench: done.")

            print("instr: ", hex(rom[rom_addr]))

            yield dut.i_tmp_instruction.eq(rom[rom_addr])
            yield Settle()

            assert (yield dut.decoder.o_invalid) == False

            yield Tick()
            yield Settle()

        read_reg = lambda x: dut.registers.registers.word_select(x, 32)

        assert (yield read_reg(1)) == 42
        assert (yield read_reg(2)) == 42
        assert (yield read_reg(3)) == 0x12345000
        assert (yield read_reg(5)) == 10
        assert (yield read_reg(4)) == 10
        assert (yield read_reg(6)) == 0x1000 + 4 * rom.index(0xFF9FF36F) + 4
        assert (yield read_reg(7)) == 0x1000
        assert (yield read_reg(8)) == 0x1000 + 4 * rom.index(0x03438467) + 4
        assert (yield read_reg(9)) == 0x1000_0000
        assert (yield read_reg(10)) == 0xC0FFEE42

        yield Passive()

    def memory_cosim():
        lsu = dut.lsu

        was_busy = False

        while not done[0]:
            lsu_mode = yield lsu.i_lsu_mode
            if lsu_mode == INTERNAL_LSU_MODE_DISABLED:
                was_busy = False
                yield lsu.o_data.eq(0)
                yield lsu.o_done.eq(0)
            elif lsu_mode == INTERNAL_LSU_MODE_LOAD and was_busy is False:
                was_busy = True
                base = yield lsu.i_address_base
                offset = yield lsu.i_address_offset
                addr = base + offset
                funct3 = yield lsu.i_funct3
                print(f"memory read request: addr={hex(addr)}")

                yield Tick()  # a read takes a while
                yield Tick()
                yield Tick()

                ram_offset = addr - 0x10000000
                if funct3 == FUNCT3_LOAD_W:
                    value = (ram[ram_offset + 3] << 24) | (ram[ram_offset + 2] << 16) | (ram[ram_offset + 1] << 8) | ram[ram_offset]
                # ...

                yield lsu.o_data.eq(value)
                yield lsu.o_done.eq(1)
            # ...

            yield Tick()
        print("memory_cosim: done.")

        yield Passive()



My development time ran out before I completed the project, so no game of life on my toy CPU for now. So what’s missing?

  • memory mapped I/O, so that instead of keeping the LEDs in memory, signals / pins of the CPU are used,
  • adding support for a few missing read / write operations to the memory controller (read byte, write byte),
  • integrating the emulator and simulator, re-using the existing debugger user interface,
  • and then likely spending some time on debugging,
  • maybe porting the simulator to Verilator or another framework to make it fast enough.

But I thought having a blog post is much better than completing this experiment now. I hope to find time in the future to again work on this, finally run game of life on my CPU and actually run it in an FPGA. That would be fun.

But the best part is really: I’ve learned so much as you’ve read. Try it yourself. Thank you for reading 🙂

Full source code 

You can find my source code at . It’s not as clean as the snippets above, but I hope it provides additional context if you’d like to dive deeper.

Additional Material 

If you want to learn more I’ve collected some links that helped me below:

FHRP Nightmare. Pentesting redundancy systems like a devil.

FHRP Nightmare. Pentesting redundancy systems like a devil

original text by Magama Bazarov

In order to improve the resiliency of their network at the routing level, network administrators use the FHRP family of protocols in most cases. However, in most cases, the configuration of FHRP protocols is left by default, which opens the way for exploitation.

My name is Magama Bazarov and I am an expert in network security. And in my research you will find out what kind of nightmare is going on in the network at the routing level.


This article is intended for security professionals who conduct pentests under an agreed-upon, legitimate contract. Destroying and hacking into other people’s computer networks can be prosecuted. Be careful and try not to test your fate. The skills you learn from my article are only your area of responsibility.

1. Why we need FHRP protocols?

FHRP (First Hop Redundancy Protocol) — is a family of network protocols that allows multiple physical routers to share/maintain a single virtual IP address, in order to increase the fault tolerance of the local network. This virtual address will be assigned as the default gateway address for the end hosts. The most common FHRP class protocols are HSRP, VRRP and GLBP, the security of which I will discuss in this article.

2. HSRP (Hot Standby Redundancy Protocol)

HSRP (Hot Standby Router/Redundancy Protocol) — is a Cisco proprietary protocol that allows for network gateway redundancy. The general idea is to combine several physical routers into one logical router with a common IP address. This address of the virtual router will be assigned to the interface of the router with the master role, and the latter, in its turn, will take care of traffic forwarding.

2.1 HSRP Roles & Terminology

HSRP Active Router — a device that acts as a virtual router and provides forwarding of traffic from source networks to destination networks.
HSRP Standby Router — a device that acts as a standby router, waiting for the active router to fail. When the primary Active router fails, the Standby router will take over the primary role and take over the duties of the Active router.
HSRP Group — a group of devices that ensures the operation and fault tolerance of a logical router.
HSRP MAC Address — the virtual MAC address of the logical router in the HSRP domain.
HSRP Virtual IP Address — This is a special virtual IP address in the HSRP group. This IP address will be the default gateway for the end hosts, used on the logical router itself.

2.2 HSRP Mechanics

HSRP is implemented on top of TCP/IP protocol stack, so UDP transport layer protocol under port number 1985 is used for service information transmission. HSRP-routers within the same logical group exchange special “hello” packets every 3 seconds, but if within 10 seconds an HSRP-router within the same group has not received a hello packet from its HSRP neighbor, it recognizes it as “dead”

This protocol has two versions (HSRPv1 x HSRPv2) and they differ in the following characteristics:

  • Number of groups. (HSRPv1 offers up to 255 groups, when HSRPv2 can up to 4096)
  • Virtual MAC addresses. (HSRPv1 — 
     / HSRPv2 — 
    ) (XX is HSRP group number)
  • MCAST IP addresses. (HSRPv1 —
     / HSRPv2 —

3. VRRP (Virtual Router Redundancy Protocol)

This protocol was developed based on the phenomena of the HSRP protocol, so it has a lot of trouble with patents. Therefore, it cannot be called “free” and “open”. But at least it is supported by all network equipment vendors, i.e. using VRRP in your network allows you to be independent of any vendor ecosystem.

3.1 VRRP Mechanics

De facto, if you know how HSRP works, you know how VRRP works. HSRP and VRRP have a lot in common, but I’ll tell you the distinguishing characteristics:

  • VRRP is not implemented on top of the TCP/IP protocol stack. This protocol works exclusively on the network layer
  • Its MCAST IP address is
  • Its identifier is 
  • The second version of VRRPv2 features IPv4 support and support for authentication. The type of authentication depends on the vendor. For example, Cisco offers VRRP protection using MD5 authentication, while MikroTik (RouterOS) offers AH-authentication (AH is a protocol from the IPSec opera)
  • The third version, VRRPv3, has support for IPv6 but lacks authentication.

VRRP neighbors within one redundancy domain exchange special hello packets every second (a kind of hello time). But there is also a kind of “dead timer” — if there is no hello packet within 10 seconds, the router from which this “hello” was expected — will drop out of the failover domain.

3.2 VRRP Roles & Terminology

  • VRRP Master Router is an active router that is responsible for transferring legitimate traffic on the network.
  • VRRP Backup Router is a standby router. As soon as the current Master Router goes down, it will take over its role and perform its functions
  • VRRP MAC Address — The virtual MAC address in the VRRP group (
    ). Instead of XX, it is the number of the VRRP group.
  • VRRP VRID — The identifier of the VRRP group within which the physical routers are located.
  • VRRP Virtual IP Address — A special virtual IP address in the VRRP domain. This IP address is used as the default gateway for the end hosts.

4. Emergence of pseudo-balancing

The problem is that the HSRP and VRRP protocols have no load balancing mechanism. When they are used, there is a pseudo-balancing, where by default only one device is actually working, while the others are resting and working in standby mode. However, you can simply spread your VLANs over logical HSRP/VRRP processes at the distribution switch level (L3 switches) or at the router level when logical VLANs are created (802.1Q Encapsulation Moment)

Below will be examples of settings for HSRP and VRRP with respect to VLAN 10 and VLAN 30 networks. Dist-SW1 will plow on VLAN 10 and sleep on VLAN 30. Dist-SW2 will plow for VLAN 30 and sleep for VLAN 10.

4.1 HSRP for VLANs (Cisco IOS)

Dist-SW1(config)# interface vlan 10
Dist-SW1(config-if)# standby 10 ip
Dist-SW1(config-if)# standby 10 priority 150
Dist-SW1(config-if)# standby 10 preempt
Dist-SW1(config-if)# standby 10 authentication md5 key-string my_heartbeats
Dist-SW1(config-if)# end

Dist-SW1(config)# interface vlan 30
Dist-SW1(config-if)# standby 30 ip
Dist-SW1(config-if)# standby 30 priority 90
Dist-SW1(config-if)# standby 30 authentication md5 key-string my_heartbeats
Dist-SW1(config-if)# end

Dist-SW2(config)# interface vlan 10
Dist-SW2(config-if)# standby 10 ip
Dist-SW2(config-if)# standby 10 priority 90
Dist-SW2(config-if)# standby 10 authentication md5 key-string my_heartbeats
Dist-SW2(config-if)# end

Dist-SW2(config)# interface vlan 30
Dist-SW2(config-if)# standby 30 ip
Dist-SW2(config-if)# standby 30 priority 150
Dist-SW2(config-if)# standby 30 preempt
Dist-SW2(config-if)# standby 30 authentication md5 key-string my_heartbeats
Dist-SW2(config-if)# end

4.2 VRRP for VLANs (Cisco IOS)

Dist-SW1(config)# interface vlan 10
Dist-SW1(config-if)# vrrp 10 ip
Dist-SW1(config-if)# vrrp 10 priority 150
Dist-SW1(config-if)# vrrp 10 preempt
Dist-SW1(config-if)# vrrp 10 authentication md5 key-string my_heartbeats
Dist-SW1(config-if)# end

Dist-SW1(config)# interface vlan 30
Dist-SW1(config-if)# vrrp 30 ip
Dist-SW1(config-if)# vrrp 30 priority 90
Dist-SW1(config-if)# vrrp 30 authentication md5 key-string my_heartbeats
Dist-SW1(config-if)# end

Dist-SW2(config)# interface vlan 10
Dist-SW2(config-if)# vrrp 10 ip
Dist-SW2(config-if)# vrrp 10 priority 90
Dist-SW2(config-if)# vrrp 10 authentication md5 key-string my_heartbeats
Dist-SW2(config-if)# end

Dist-SW2(config)# interface vlan 30
Dist-SW2(config-if)# vrrp 30 ip
Dist-SW2(config-if)# vrrp 30 priority 150
Dist-SW2(config-if)# vrrp 30 preempt
Dist-SW2(config-if)# vrrp 30 authentication md5 key-string my_heartbeats
Dist-SW2(config-if)# end

5. GLBP (Gateway Load Balancing Protocol)

Developed by Cisco Systems engineers. Like HSRP, this protocol is implemented on top of TCP/IP protocol stack, that’s why UDP transport layer protocol under port number 3222 is used for translation of service information. GLBP routers within the same logical group exchange special “hello” packets every 3 seconds, but if within 10 seconds a GLBP router within the same group has not received a hello packet from its GLBP neighbor, it recognizes it as “dead”. However, the timer values can be configured depending on the administrator’s needs.

5.1 GLBP Mechanics

GLBP provides load sharing to multiple routers (gateways) using one virtual IP address and multiple virtual MAC addresses. Each host is configured with the same virtual IP address and all routers in the virtual group participate in packet transmission. Works much differently with respect to HSRP and VRRP protocols, as it uses the mechanisms of real load balancing, I will denote below:

  • The Host-Dependent. Type of load balancing used on a network where there is NAT. Host-Dependent ensures that the host will return the same MAC address of the AVG device that was used at an earlier point in time, thus the NAT configured to the host will not be broken.
  • Round-Robin. In this mode, the AVG device distributes MAC addresses to AVF members alternately. This is the mechanism used by default.
  • Weight-based round-robin. Load balancing based on special “Weight” metric

5.2 GLBP Roles & Terminology

  • AVG (Active Virtual Gateway) — a device that is essentially the father of the entire GLBP logical domain. “Father” tells the other routers how to handle legitimate traffic. Gives out MAC addresses and is responsible for answering ARP requests. By the way, within a single GLBP group, AVG members can be only one router.
  • AVF (Active Virtual Forwarder) — the device in the GLBP domain that handles traffic. There can be several of them.
  • GLBP Group — A logical GLBP group that includes physical routers. Each GLBP logical group has its own unique numeric identifier
  • GLBP MAC — The virtual MAC address of the AVF members distributed by the existing AVG router.
  • GLBP Virtual IP Address — The IP address the AVG router is responsible for
  • GLBP Preempt Mode — an option that allows the resurrected AVG device to regain its role after being replaced by AVF based on its priority values. By default, preempt mode is disabled for AVG members when preempt mode is enabled for AVF members (with a delay of up to 30 seconds, but this value can be configured manually)
  • GLBP Weight — metric indicating the degree of load on the device interface. The greater this metric is, the higher the load on the router interface.

6. FHRP Selector

For FHRP protocols, the default priority value of routers is 100. If administrator did not set priorities manually, in case of HSRP (ACTIVE), in case of VRRP (MASTER), in case of GLBP (AVG) will be the router with the highest address. Of course, these priorities are configured manually, depending on the needs of the network administrator and what fault tolerance infrastructure he needs.

7. FHRP Timings

  • HSRP (Hello time: 3 sec / Hold time: 10 sec)
  • VRRP (Hello time: 1 sec / Hold time: 3 sec)
  • GLBP (Hello time: 3 sec / Hold time: 10 sec)

8. FHRP Hijacking

FHRP domains are vulnerable to a Hijack attack if the ACTIVE/MASTER/AVGleader does not have a maximum priority value with respect to its configuration. If an attacker injects an HSRP/VRRP/GLBP packet with maximum priority values, he will be able to intercept traffic within the network.

8.1 Stages

  • Information Gathering. find out priority values, search for authentication, virtual IP address used, MAC addresses
  • Authentication Bypassing.The authentication bypassing stage. If there is, of course. Save traffic dump, exfiltrate hashes and brute-force key from the domain.
  • Injection.Preparing network interface, writing MAC address, routing permission, generating and sending malicious FHRP injection.
  • Routing Management.Creating a secondary IP address, configuring a new default route, (NAT) MASQUERADE

8.2 Weaponize

  • Wireshark. With this network traffic analyzer we will perform Information Gathering process, enumerating packet headers
  • John & *2john-exfiltrators. John is a hash bruteforcer, *2john scripts will help reproduce hash exfiltration from the traffic dump
  • Loki. It is a batch injector, a framework for performing security analysis of various L2/L3 protocols, even DRP, FHRP, etc.

8.3 Vectors

  • MITM. A man-in-the-middle attack to intercept legitimate traffic. Executed by malicious FHRP injection with maximum priority value.
  • Blackhole. An attack designed to wrap traffic from a legitimate subnet into ANYTHING. Blackhole means “black hole”. Traffic goes into black hole and that’s it
  • Kicking router via UDP Flood. DoS-attack, the mechanism of which consists of mass mailing of UDP datagrams with the aim to put the destination router out of action. Works regarding HSRP & GLBP, because when processes of these protocols are launched on routers, they automatically start listening to UDP-ports 1985 and 3222 (HSRP and GLBP respectively), no transport layer for VRRP.

8.4 Limitations

  • CPU power dependency. After MITM attack, traffic of the whole network or VLAN segment (depending on existing infrastructure) will run through your device and it must be ready to process, routing traffic of the whole legitimate segment. Otherwise, a DoS will occur, and the customer will not appreciate such a scenario and hot tears will flow.
  • Network interface performance. The network card used at the moment of attack should be powerful enough to handle a large amount of traffic.
  • Dependence on possible network VLAN segmentation. If you’re on VLAN 10, for example, and you can MITM through an inject, you’ll be intercepting traffic from the VLAN you’re on. You won’t get traffic from other VLANs.

9. Nightmare Realm

As part of a practical attack, I built a three-layer network which I provided Internet access, OSPF, and HSRP/VRRP/GLBP fault tolerance domains. As part of my attack, I will be targeting the Distribution Layer and intercepting traffic on the VLAN 10 network.

Papercut — FTP server under IP address

Dustup — Windows 10 machine with IP address

Attacker (Mercy) — attacker system with Kali Linux at IP address

Boundless — Linux Mint client machine with IP address

The left side of the Distribution level switches are Cisco vIOS switches: Dist-SW1 and Dist-SW2 at
, respectively

Edge Router — the edge router of this network, provides the entire network with the Internet by the mechanism NAT (PAT Mechanism Moment)

As a dynamic routing acts as a BACKBONE zone OSPF and identifier 


And the HSRP/VRRP/GLBP processes are implemented on virtual SVI interfaces.

10. HSRP Hijacking

10.1 Stage 1: Information Gathering

From the Wireshark traffic dump, we see HSRP announcements from two switches at and

HSRP Ad from first router

HSRP Ad from second router

Based on the analysis of HSRP packet headers we have the following picture (within my network I have studied 2 HSRP announcements):

  • ACTIVE device is a switch with address
    , its priority is 
  • STANDBY device is a switch at
    , its priority is 
  • Cryptographic authentication (MD5) is used
  • Virtual IP address
  • Virtual MAC: 
  • HSRP group number 

Since the ACTIVE switch has a priority of 

 out of 
— the Hijack attack vector looms. But also, there is the use of authentication. Authentication with respect to FHRP prevents unauthorized devices from entering the fault tolerance process. But since we are the attacker, we need to find out what kind of key is used as authentication to the HSRP domain.

10.2 Stage 2: Authentication Bypassing

You need to save the traffic dump in 

format, then use 
(in my case the script is called
 to perform hash exfiltration. This script will spit out MD5 hashes and convert them to John format at the same time, which is convenient. Then these hashes can be dumped into a separate file and fed to John, i.e. this file will act as input for the brute force. And with key 
— wordlist
 we will set path to dictionary

in9uz@Mercy:~$ python2 hsrpwithauth.pcap
in9uz@Mercy:~$ cat >> hsrpv1md5hashes
in9uz@Mercy:~$ john hsrpv1md5hashes --wordlist=/usr/share/wordlists/rockyou.txt

In the end we managed to reset the authentication key to the HSRP domain — endgame

10.3 Stage 3: Injection

First of all, we need to change the MAC address to the virtual MAC address of the ACTIVE switch. This is done in order not to unnecessarily trigger the system DAI (Dynamic ARP Inspection), because 

 may well be surprised that with respect to
 now has a different MAC-address.

Next, switch our network interface to promiscious mode and allow traffic routing.

in9uz@Mercy:~$ sudo macchanger --mac=00:00:0C:07:AC:01 eth0
in9uz@Mercy:~$ sudo ifconfig eth0
in9uz@Mercy:~$ sudo sysctl -w net.ipv4.ip_forward=1

Now you have to start Loki, select the network interface

in9uz@Mercy:~$ sudo
Select interface

After launching the tool, Loki itself will detect HSRP announcements sent out by routers. Enter the MD5 authentication key (endgame), select the 

Do gratuitous ARP option.
This is a special modification of ARP-frame, which is in fact a stub, which notifies all broadcast domain about new binding of IP-address with MAC-address. Next, click on the two devices (in Seen state) and select the Get IP option.

After that, Loki will generate a custom injection with a maximum priority value of 

, which, in turn, will allow the attacker to intercept the ACTIVE role and thus stand in the Man-in-the-Middle moment

10.4 Stage 4: Routing Management

After the injections are done, we need to do a little routing management.

Firstly we have to remove the old default route, going through,
since we became the new ACTIVE router, we own this virtual address 
, but with the old route all traffic is looped back to our OS, which without any extra work causes DoS against legitimate hosts :))

We also need to create a secondary address on our interface with HSRP Virtual IP Address value
, again, after the attack we are the owner of this address.

in9uz@Mercy:~$ sudo ifconfig eth0:1 netmask

Let’s write a new default route through
 (this is the former ACTIVE router), but even though we took away the ACTIVE role from it, it is still able to route traffic where we need it. You can also try to write the second static route through
 with the special metric, i.e. here with the second route we already realize the floating static routing.

in9uz@Mercy:~$ sudo route del default
in9uz@Mercy:~$ sudo route add -net netmask gw


in9uz@Mercy:~$ sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

In the end we succeeded in conducting a MITM attack. To clearly show the impact of the attack, I will connect to the FTP server Papercut

Intercepted FTP creds

Intercepted FTP creds — 


!!! However, ICMP REDIRECT messages are generated during the attack itself. The IPS/IDS security system may trigger. Be careful and disable ICMP redirect if possible !!!

11. VRRP Hijacking

11.1 Stage 1: Information Gathering

Within VRRP, these hello packets are only visible from the MASTER device.

Wireshark VRRP dump

Based on the analysis of the HSRP packet headers we denote the following information:

  • MASTER device is a router at
    , its priority is 
  • No authentication
  • Virtual IP-address
  • Virtual MAC: 
  • VRRP group number 

11.2 Stage 2: Injection

Change the MAC-address to the one belonging to MASTER, promisc mode and allow routing:

in9uz@Mercy:~$ sudo macchanger --mac 00:00:5e:00:01:01 eth0
in9uz@Mercy:~$ sudo ifconfig eth0 promisc
in9uz@Mercy:~$ sudo sysctl -w net.ipv4.ip_forward=1

Run Loki:

in9uz@Mercy:~$ sudo

Loki detected the VRRP announcement. All the same, generate a (Gratuitous ARP) and inject

Loki performed an inject, you can examine his body 

(Inject has a priority of 255)

11.3 Stage 3: Routing Management

All the same routing management. Writing routes, creating a secondary address, and setting up NAT (MASQUERADE)

in9uz@Mercy:~$ sudo ifconfig eth0:1 netmask
in9uz@Mercy:~$ sudo route del default
in9uz@Mercy:~$ sudo route add -net netmask gw
in9uz@Mercy:~$ sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

To illustrate, I will connect to that FTP-server again (Papercut), but with different codes 


Intercepted FTP creds


12. GLBP Hijacking

12.1 Stage 1: Information Gathering

We see GLBP announcements from two devices. Based on the analysis of the GLBP packets we have the following picture:

  • AVG is
  • AVF is
  • No authentication
  • GLBP Group Number 
  • Virtual IP Address
  • Virtual MAC: 


12.2 Stage 2: Injection

Change the MAC-address to the one of the AVG, promiscious mode and routing permission

in9uz@Mercy:~$ sudo macchanger --mac=00:07.b4:00.01:01 eth0
in9uz@Mercy:~$ sudo ifconfig eth0 promisc
in9uz@Mercy:~$ sudo sysctl -w net.ipv4.ip_forward=1

Run Loki:

in9uz@Mercy:~$ sudo

Loki found the ads. Performing Inject x Gratuitous ARP

Loki generated a GLBP injection with priority and weight values (Priority 255 x Weight 255)

12.3 Stage 3: Routing Management

Again: Secondary address, routing, NAT

in9uz@Mercy:~$ sudo ifconfig eth0:1 netmask
in9uz@Mercy:~$ sudo route del default
in9uz@Mercy:~$ sudo route add -net netmask gw
in9uz@Mercy:~$ sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Now, we are the MITM. Run 

<a rel="noreferrer noopener" href="" target="_blank"></a> 
This tool will pull sensitive data during traffic analysis (i.e. unencrypted HTTP/FTP traffic, NetNTLM hashes)

in9uz@Mercy:~/FHRPNightmare/net-creds$ sudo python2 -i eth0

To proof, reading the SMB share on behalf of the 


Intercepted NetNTLMv2-SSP hash of user distance
Why not? Bruteforce the hash

13. Prevention

13.1 Using ACL

An ACL allows you to filter traffic by various parameters, ranging from source IP address to TCP/UDP ports. (depending on which ACL you use — standard or extended?)

  • HSRP: ACL against
  • VRRP: ACL against
  • GLBP: ACL against

13.2 Authentication

Authentication prevents unauthorized routers from entering failover domains. If you’re going to protect a domain with authentication, use persistent keys so they’re not so easy to break.

Here Cisco IOS boasts Key-Chain authentication, where multiple keys can be used and you can configure the time intervals within which the keys will be accepted and sent. RouterOS even has a wild AH-authentication for VRRP, such a salt is used there — you can’t brute force it even with a dictionary password. By the way, Cisco IOS uses MD5 authentication for FHRP, and RouterOS uses AH (the IPSec opera protocol)

14. Outro

FHRP protocols help to organize a system of hot redundant routing gateways. Such systems are widespread in the case I reviewed. But now you know what can happen to a network if an engineer has not taken care of the design and security configuration of the FHRP protocols themselves.

Speaking of which, this FHRP Hijacking can serve as an alternative to ARP spoofing. AD networks open up all possibilities for Relay attacks and information gathering, you can also implement phishing attacks and much more. I very much hope that this research of mine will give pentesters new attack vectors for pentesters, and network administrators will get new cases to improve their network security.

My Telegram:

Telegram Channel:



You can see, what you wanna see. You can feel, what you wanna feel.