Remapping Python Opcodes

Original text by Chris Lyne

In my previous blog post, I talked about compiled Python (.pyc) files that I couldn’t manage to decompile. Because of this, my ability to audit the source code of Druva inSync was limited, and I felt compelled to dig deeper. What I found was that all of the operation codes (opcodes) had been shuffled around. This is a known technique for obfuscating Python bytecode. I’ll take you step by step through how I fixed these opcodes to successfully recover the source code. This technique is not limited to Druva and can generally be used to remap any Python opcodes.

Let’s first look at the problem.

It won’t decompile

In the installation directory of Druva inSync, you’ll notice that there is a Python27.dll and When inSync.exe boots up, these files are immediately read. I’ve filtered the procmon output for brevity. This behavior is indicative of an application built with py2exe.

When you unzip, it contains a bunch of compiled Python modules — custom and standard. Shown below is a subset of the whole.

.pyc’s can often be decompiled pretty quickly by tools like uncompyle6. For example, struct.pyc is part of the Python standard library, and it decompiles just fine — only a few imports.

Decompiling normal struct.pyc

Now, doing the same against the struct.pyc packaged in, here’s the output:

Decompiling Druva struct.pyc

Unknown magic number 62216.

But why?

If you were to look at a .pyc file in a hex editor, the magic number is in the first 2 bytes of the file, and it will be different depending on the Python version that compiled it. For example, here is 62216:

Druva magic number

62216 is not a documented magic number. But 62211 is close to it, and that corresponds to version 2.7.

What’s strange is that the python27.dll distributed in the Druva inSync installation is version 2.7, hence the DLL name. Yet, its magic number is different.

Druva python27.dll

This looks nearly identical to the normal 2.7.15150.1013 DLL.

Normal python27.dll

The size and modified date are a little different, but the version is the same. If the version is the same, why is the magic number not 62211?

Digging into the OpCodes

My next idea was to load up a Python interpreter using the Druva inSync libraries. I did this by first dropping the Druva python27.dll into c:\Python27. I also had to ensure that the search path pointed to the Python modules distributed with Druva inSync.

Python interpreter with Druva libraries loaded

At this point, I could load the ‘opcode’ module to view the map of opcodes.

Druva opcode map

Below is the normal Python 2.7 opcode map:

Normal opcode map

Notice that the opcodes are completely different. For example, ‘CALL_FUNCTION’ maps to 131 normally, and its opcode is 111 in the Druva distribution. As far as I can tell, this is true for every operation.

Remapping the OpCodes

In order to decompile these obfuscated .pyc files, the opcodes need to be remapped back to the original Python 2.7 mapping. Easy enough, right? It’s slightly more complicated than it appears on the surface. In order to accomplish this, one needs to understand the .pyc file format. Let’s take a look at that.

Structure of a .pyc

Let’s turn to the code to make sense of the .pyc file structure. We are looking at the py_compile module’s compile function. This function will convert a .py file into a .pyc.

Starting at line 106, timestamp is first populated with the .py file’s last modification time (st_mtime). And on line 111, the source code is read into codestring.

Next, the source code is compiled into a code object using the builtin compile function. Since the .py likely contains a sequence of statements, the ‘exec’ mode is specified.

Assuming no errors occur, the new filename is created (cfile). If basic optimizations were turned on via the -o flag (__debug__ tells us this), the extension will be ‘.pyo’, otherwise it will be ‘.pyc’.

Finally, the file is written to. The first 4 bytes will be the magic string value, returned by imp.get_magic(). Next, the timestamp is written. And finally, the code object is serialized using the marshal module.

Let’s look at an example by compiling our own module.

Example: Hello world

Here’s our friend, It’s just a print statement.

If we compile it, it spits out a hello.pyc file.

Here is a hexdump of hello.pyc

If we were to load this up, we can actually parse out the individual components. First we read the file, and store the contents in bytes :

The magic string is 03f30d0a; however, the magic number is 03f3. It’s always followed by 0d0a.

If we unpack this unsigned short, the magic number is 62211. We now know the .pyc was compiled by version 2.7. Let’s look at the timestamp now. It is 4 bytes long, starting at offset 4.


This makes sense because I created the .py file at 2:26 PM on April 30th.

And finally, the serialized code object remains to be read. It can be deserialized with the marshal module, and the object is executable. Hello world!

Let’s frame up the problem to be solved. The main goal is to decompile a .pyc file, and fix its opcodes. During decompilation, an intermediate step is to disassemble the .pyc code objects into opcodes.

Disassembling Code Objects

Let’s use the dis module to disassemble the code object in hello.pyc.

All of these instructions are required to print ‘Hello world!’. In the first instruction, we can see “0 LOAD_CONST 0 (‘Hello world!’)”. “0 LOAD_CONST” means a LOAD_CONST operation starts at offset 0 in the bytecode. And “0 (‘Hello world!’)” means that the constant at index 0 is loaded (the string is just shown in the disassembly output for clarity). Technically speaking, LOAD_CONST pushes a constant onto the stack.

Looking at the code object, the bytecode (co_code) and constants (co_consts) are accessible (and variables, etc).

Here is the raw bytecode:

Here the opcode at offset 0 is ‘d’, which is actually decimal 100 in ascii. This can be looked up in the opname sequence.

The next two bytes, “\x00\x00” represent the index of the ‘Hello world!’ constant (operand).

We’ve now established that code objects can be disassembled with the dis module. The disassembly displays instructions consisting of operation names and operands. We can also inspect the raw bytecode (co_code) and constants (co_consts) stored in code objects (other stuff as well). It gets tricky when code objects contain nested code objects.

Since we have the opcode mappings for both Druva and the normal Python 2.7, we can develop a basic strategy for opcode conversion. My strategy was to disassemble the code object in a .pyc file, convert each operation, and stuff all of this into a new code object. No need to remap operands. However, it’s just a bit more complicated than that. Let’s look at nested code objects.

Nested Code Objects

Most of the modules you encounter will be more complex than the hello world. They will contain functions and classes as well.

Here is a more advanced hello world example:

Breaking it down, we have a class named “HelloClass”. The class contains functions named “__init__” and “sayHello.” Let’s disassemble the code object after reading the .pyc.

Notice the LOAD_CONST instruction at offset 9. A HelloClass code object is loaded. This HelloClass code object is stored at index 1 in co_consts.

Let’s disassemble that too.

More code objects? Yep. The __init__ and sayHello functions are code objects as well. A code object can have many layers of nested code objects. This requires the opcode remapping algorithm to be recursive!

The Algorithm

For reference, here are the opcode mappings again:

Druva opcode mapping
Normal Python 2.7 opcode mapping

Here’s my general algorithm.

Starting with the outer code object in the .pyc file (code_obj_in), convert all of the opcodes using the mappings above and store into new_co_code. For example, if a CALL_FUNCTION is encountered, the opcode will be converted from 111 to 131. We will then inspect the co_consts sequence and recursively remap any code objects found in there. new_co_consts will be added into the output code object.

When the new .pyc file is created (not shown), it will have a magic number of 62211, and all code objects will be populated with remapped opcodes. Let’s see the script in action.

Running the process converts a total of 1773 .pyc files. Notice I copied the Druva python27.dll into C:\Python27. Bytecode was disassembled using the Druva opcode mappings, and then converted.

Converting opcodes

And after conversion, we can successfully decompile the .pyc’s in the inSyncClient folder! Prior to opcode conversion, this was not possible.

Decompilation is successful

Closing Thoughts

I hope this serves as a useful introduction to how Python opcodes might be obfuscated. There are other tools (e.g. pyREtic) out there that do the same kind of remapping process we’ve discussed here. In fact, after writing this code, I found out that this logic had already been implemented specifically for Druva inSync in the dedrop repository.

I’m sure there are more elegant approaches to opcode conversion, but the script definitely got the job done. If you’re interested in checking out the full source code, I’ve dropped it on our GitHub. Thanks for reading, and check out the Tenable TechBlog for more technical blogs and vulnerability write-ups. Give me a shout on Twitter as well!

-Chris Lyne (@lynerc)

РубрикиБез рубрики

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *