ReverseAPK — Quickly Analyze And Reverse Engineer Android Packages

Quickly analyze and reverse engineer Android applications.

FEATURES:

  • Displays all extracted files for easy reference
  • Automatically decompile APK files to Java and Smali format
  • Analyze AndroidManifest.xml for common vulnerabilities and behavior
  • Static source code analysis for common vulnerabilities and behavior
    • Device info
    • Intents
    • Command execution
    • SQLite references
    • Logging references
    • Content providers
    • Broadcast recievers
    • Service references
    • File references
    • Crypto references
    • Hardcoded secrets
    • URL’s
    • Network connections
    • SSL references
    • WebView references

INSTALL:

./install

USAGE:

reverse-apk <apk_name>

 

Реклама

Retargetable Machine-Code Decompiler: RetDec

RetDec is a retargetable machine-code decompiler based on LLVM. The decompiler is not limited to any particular target architecture, operating system, or executable file format:

  • Supported file formats: ELF, PE, Mach-O, COFF, AR (archive), Intel HEX, and raw machine code.
  • Supported architectures (32b only): Intel x86, ARM, MIPS, PIC32, and PowerPC.

 

Features:

  • Static analysis of executable files with detailed information.
  • Compiler and packer detection.
  • Loading and instruction decoding.
  • Signature-based removal of statically linked library code.
  • Extraction and utilization of debugging information (DWARF, PDB).
  • Reconstruction of instruction idioms.
  • Detection and reconstruction of C++ class hierarchies (RTTI, vtables).
  • Demangling of symbols from C++ binaries (GCC, MSVC, Borland).
  • Reconstruction of functions, types, and high-level constructs.
  • Integrated disassembler.
  • Output in two high-level languages: C and a Python-like language.
  • Generation of call graphs, control-flow graphs, and various statistics.

 

After seven years of development, Avast open-sources its machine-code decompiler for platform-independent analysis of executable files. Avast released its analytical tool, RetDec, to help the cybersecurity community fight malicious software. The tool allows anyone to study the code of applications to see what the applications do, without running them. The goal behind open sourcing RetDec is to provide a generic tool to transform platform-specific code, such as x86/PE executable files, into a higher form of representation, such as C source code. By generic, we mean that the tool should not be limited to a single platform, but rather support a variety of platforms, including different architectures, file formats, and compilers. At Avast, RetDec is actively used for analysis of malicious samples for various platforms, such as x86/PE and ARM/ELF.

 

What is a decompiler?

A decompiler is a program that takes an executable file as its input and attempts to transform it into a high-level representation while preserving its functionality. For example, the input file may be application.exe, and the output can be source code in a higher-level programming language, such as C. A decompiler is, therefore, the exact opposite of a compiler, which compiles source files into executable files; this is why decompilers are sometimes also called reverse compilers.

By preserving a program’s functionality, we want the source code to reflect what the input program does as accurately as possible; otherwise, we risk assuming the program does one thing, when it really does another.

Generally, decompilers are unable to perfectly reconstruct original source code, due to the fact that a lot of information is lost during the compilation process. Furthermore, malware authors often use various obfuscation and anti-decompilation tricks to make the decompilation of their software as difficult as possible.

RetDec addresses the above mentioned issues by using a large set of supported architectures and file formats, as well as in-house heuristics and algorithms to decode and reconstruct applications. RetDec is also the only decompiler of its scale using a proven LLVM infrastructure and provided for free, licensed under MIT.

Decompilers can be used in a variety of situations. The most obvious is reverse engineering when searching for bugs, vulnerabilities, or analyzing malicious software. Decompilation can also be used to retrieve lost source code when comparing two executables, or to verify that a compiled program does exactly what is written in its source code.

There are several important differences between a decompiler and a disassembler. The former tries to reconstruct an executable file into a platform-agnostic, high-level source code, while the latter gives you low-level, platform-specific assembly instructions. The assembly output is non-portable, error-prone when modified, and requires specific knowledge about the instruction set of the target processor. Another positive aspect of decompilers is the high-level source code they produce, like  C source code, which can be read by people who know nothing about the assembly language for the particular processor being analyzed.

 

Installation and Use

Currently,RetDec support only Windows (7 or later) and Linux.

 

Windows

  1. Either download and unpack a pre-built package from the following list, or build and install the decompiler by yourself (the process is described below):
  2. Install Microsoft Visual C++ Redistributable for Visual Studio 2015.
  3. Install MSYS2 and other needed applications by following RetDec’s Windows environment setup guide.
  4. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    bash decompile.sh test.exe
    

    For more information, run bash decompile.sh --help.

 

Linux

  1. There are currently no pre-built packages for Linux. You will have to build and install the decompiler by yourself. The process is described below.
  2. After you have built the decompiler, you will need to install the following packages via your distribution’s package manager:
  3. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    ./decompile.sh test.exe
    

    For more information, run ./decompile.sh --help.

 

Build and Installation


Requirements

Linux

On Debian-based distributions (e.g. Ubuntu), the required packages can be installed with apt-get:

sudo apt-get install build-essential cmake git perl python bash coreutils wget bc graphviz upx flex bison zlib1g-dev libtinfo-dev autoconf pkg-config m4 libtool

 

Windows

  • Microsoft Visual C++ (version >= Visual Studio 2015 Update 2)
  • Git
  • MSYS2 and some other applications. Follow RetDec’s Windows environment setup guide to get everything you need on Windows.
  • Active Perl. It needs to be the first Perl in PATH, or it has to be provided to CMake using CMAKE_PROGRAM_PATH variable, e.g. -DCMAKE_PROGRAM_PATH=/c/perl/bin.
  • Python (version >= 3.4)

 

Process

Warning: Currently, RetDec has to be installed into a clean, dedicated directory. Do NOT install it into /usr,/usr/local, etc. because our build system is not yet ready for system-wide installations. So, when running cmake, always set -DCMAKE_INSTALL_PREFIX=<path> to a directory that will be used just by RetDec. 

  • Recursively clone the repository (it contains submodules):
    • git clone --recursive https://github.com/avast-tl/retdec
  • Linux:
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path>
    • make && make install
  • Windows:
    • Open MSBuild command prompt, or any terminal that is configured to run the msbuild command.
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path> -G<generator>
    • msbuild /m /p:Configuration=Release retdec.sln
    • msbuild /m /p:Configuration=Release INSTALL.vcxproj
    • Alternatively, you can open retdec.sln generated by cmake in Visual Studio IDE.

You have to pass the following parameters to cmake:

  • -DCMAKE_INSTALL_PREFIX=<path> to set the installation path to <path>.
  • (Windows only) -G<generator> is -G"Visual Studio 14 2015" for 32-bit build using Visual Studio 2015, or -G"Visual Studio 14 2015 Win64" for 64-bit build using Visual Studio 2015. Later versions of Visual Studio may be used.

You can pass the following additional parameters to cmake:

  • -DRETDEC_DOC=ON to build with API documentation (requires Doxygen and Graphviz, disabled by default).
  • -DRETDEC_TESTS=ON to build with tests, including all the tests in dependency submodules (disabled by default).
  • -DCMAKE_BUILD_TYPE=Debug to build with debugging information, which is useful during development. By default, the project is built in the Release mode. This has no effect on Windows, but the same thing can be achieved by running msbuild with the /p:Configuration=Debug parameter.
  • -DCMAKE_PROGRAM_PATH=<path> to use Perl at <path> (probably useful only on Windows).

Conditional instructions in the ARM1 processor, reverse engineered

By carefully examining the layout of the ARM1 processor, it can be reverse engineered. This article describes the interesting circuit used for conditional instructions: this circuit is marked in red on the die photo below. Unlike most processors, the ARM executes every instruction conditionally. Each instruction specifies a condition and is only executed if the condition is satisfied. For every instruction, the condition circuit reads the condition from the instruction register (blue), evaluates the condition flags (purple), and informs the control logic (yellow) if the instruction should be executed or skipped.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

Why care about the ARM1 chip? It is the highly-influential ancestor of the extremely popular ARM processor. The ARM1 processor got off to a slow start in 1985 but now ARM processors are now sold by the tens of billions; your smart phone probably runs on ARM. This article is part of my series on reverse engineering the ARM1; start with my first article for an overview of the chip.

What are conditional instructions?

A key part of any computer is the ability of a program to change what it is doing based on various conditions. Most computers provide conditional branch instructions, which cause execution to jump to a different part of the program based on various condition flags. For example, consider the code if (x == 0) { do_something }. Compiled to assembly code, this first tests the value of variable x and sets the Zero flag if x is 0. Next, a conditional branch instruction jump over the do_something code if the Zero flag is not set.

The ARM processor takes conditionals much further than other processors: every instruction becomes a conditional instruction. Every instruction includes one of 16 conditions and the instruction is only executed if the condition is true; otherwise the instruction is skipped. (This is also known as predication.) The motivation is to avoid inefficient jumping around in the code.

The ARM manual excerpt below shows how four bits in each 32-bit instruction specify one of 16 conditions. Most of the conditions are straightforward, checking if values are equal, negative, higher, and so forth. Most instructions will use the «always» condition, which simply means the instruction always executes. The opposite «never» condition is not highly useful — an instruction with that condition never executes — but it can be used for a NOP, patching code, or adjusting timing of an instruction sequence.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Studying the different conditions reveals much of how the condition circuit works. It is based on four condition flags. The zero (Z) flag is set if a value is zero. The negative (N) flag is set if a value is negative. The carry (C) flag is set if there is a carry or borrow from addition or subtraction. The overflow (V) flag is set if there is an overflow during signed arithmetic (details).

The top three bits of the instruction select one of eight conditions, as highlighted in yellow. The fourth bit selects the condition or its opposite (blue). If the fourth bit is 0, the condition must be true; if the fourth bit is 1, the condition must be false.

Implementation of the circuit

The implementation of the conditional logic circuit matches the above description. First, the eight conditions are generated from the four flags. One of the conditions is selected based on the three instruction bits. If the fourth instruction bit is set, the condition is flipped. The result is 1 if the condition is satisfied, and 0 if the condition is not satisfied. One unexpected part of the circuit is that an undefined instruction or and interrupt causes the condition to be cleared, preventing execution of the instruction. The resulting condition signal output is connected to a control part of the chip, where it causes the instruction to be executed or not, as desired.

The condition code evaluation circuit from the ARM1 processor.

The condition code evaluation circuit from the ARM1 processor.

The diagram above shows the condition code circuit of the chip as it appears in the simulator; this is a zoomed-in version of the red rectangle indicated on the die earlier. The chip consists of multiple layers, indicated by different colors. Transistors appear as red or blue regions. NMOS transistors are red; they turn on with a 1 input and can pull their output low. PMOS transistors (blue) are complementary; they turn on with a 0 input and can pull their output high. Physically above the transistors is the polysilicon wiring layer (green). When polysilicon crosses a transistor it forms the gate (yellow) that controls the transistor. Finally, two layers of metal wiring (gray) are above the polysilicon.

The circuit is arranged in columns. The first column of transistors forms the logic gates to generate the conditions from the flag values. The next column is the multiplexer, a circuit that takes the eight input conditions and selects one. The rightmost column contains 8 NAND gates that decode the three instruction bits into 8 control lines. Each line is fed into the multiplexer to select the corresponding condition. At the right is the wiring for the 3 instruction bits and their complements. A few miscellaneous gates are at the bottom of the multiplexer and decoder columns. These include inverters to complement the instruction bits.

The condition generation gates

The diagram below zooms in on the left third of the circuit above. This part of the circuit uses standard CMOS logic gates to computes the conditions from the flags. Each gate is built from NMOS (red) and PMOS (blue) transistors in a horizontal strip. Comparing the text description of conditions from the manual with the logic shows how the conditions are generated. For instance, the HI (unsigned higher) condition requires flags «C set and Z clear». The top three gates generate this condition. The GE (greater than or equal) condition is more complex, requiring flags «N set and V set, or N clear and V clear». The next two gates compute this value. (Due to the way CMOS gates are constructed, an OR-NAND gate is constructed as a single gate.) Likewise, the other conditions are generated. The AL (always) condition is simply a 1, and doesn’t require any circuitry. The conditions are fed into the multiplexer, which will be discussed below.

The output coming back from the multiplexer is the selected condition, labeled «cond» below. The NAND and OR-NAND gates flip the condition if instruction register bit 28 (ireg28) is set. This implements the eight opposite conditions. The result is labeled «ok», indicating the overall condition is satisfied. The final three gates block instruction execution for an interrupt or undefined instruction.

Gates in the ARM1 processor generate the various conditionals from the flag values.

Gates in the ARM1 processor generate the various conditionals from the flag values.

One thing I’d like to emphasize about the ARM1 is that its layout is very orderly and non-optimized. While it may appear chaotic, the gates are arranged by combining relatively fixed blocks («standard cells») and wiring them together. Each gate forms a strip and the gates are stacked together in columns. The polysilicon and metal layers connect the gates as necessary.

The layout of the ARM1 chip is a consequence of the VLSI Technology chip design software used to create it. The resulting layout is simple, but doesn’t use space very efficiently. Since the ARM1 uses very few transistors for its time, the designers weren’t worried about optimizing the layout. In contrast, earlier chips such as the Z-80 were hand-drawn, with each transistor and wire carefully shaped to use the minimum space possible. The diagram below shows a small part of the Z-80 processor layout, showing the extremely irregular but dense arrangement of the chip. The transistors are not arranged in rows as in the ARM1 above, but fit together to use all the available space.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip's size.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip’s size.

The multiplexer and decoders

Selecting the desired condition out of the eight possibilities is the job of a circuit called the multiplexer. The multiplexer takes 8 inputs (the conditions) and 8 control signals (based on the instruction) and selects the desired condition. To the right of the multiplexer, 8 NAND gates generate the 8 control signals by decoding the three instruction bits. Each gate simply looks at three bit values and outputs a 0 if the bits select that condition. For instance, if the first two bits are 0 and the third is 1, the gate for condition 1 outputs a 0, selecting that condition in the multiplexer. The animation below shows the circuit as the instruction bits cycle through the eight conditions. You can see the activated condition moving downwards through the circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

While a multiplexer can be built from standard logic gates, the ARM1 multiplexer is built from a different type of circuitry called transmission gates (which the ARM1 also uses in its bit counter). A multiplexer built from transmission gates is more compact and faster than one built from standard logic (NAND gates). One feature of CMOS is that by combining an NMOS transistor and a PMOS transistor in parallel, a transmission gate switch can be built. Feeding 1 into the NMOS gate and 0 into the PMOS gate turns on both transistors and they pass their input through. With the opposite gate values, both transistors turn off and the switch opens. The multiplexer is built from 8 of these CMOS switches. Each condition input feeds into one switch, and the switch outputs are connected together. One switch is turned on at a time, selecting the corresponding input as the output value.

The diagram below shows the schematic of the multiplexer as well as its physical layout on the chip. Only the first three segments of the eight are shown; the remainder are similar. Each input is connected to two transistors forming a CMOS switch. Because the NMOS and PMOS gates require opposite signals, the multiplexer has an inverter for each control signal. Each inverter also consists of two transistors, but wired differently from the switch.

Schematic of the multiplexer inside the ARM1 processor's condition code evaluation circuit.Diagram of the multiplexer inside the ARM1 processor's condition code evaluation circuit.

Schematic and diagram of the multiplexer inside the ARM1 processor’s condition code evaluation circuit.

Working together the decode circuit, inverters, and CMOS switches form the multiplexer that selects the desired condition from the eight choices. The logic described earlier allows this condition to be flipped, for a total of 16 possible conditions.

Conclusion

One unusual feature of the ARM instruction set is that every instruction has a condition associated with it and is only executed if the condition is true. The ARM1 chip is simple enough that the condition circuitry on the chip can be examined and understood at the transistor and gate level. Now that you’ve seen the internals of the condition logic, you can use the Visual ARM1 simulator to see the circuit in action. While the ARM1 may seem like a historical artifact of the 1980s, ARM processors power most smartphones, so there’s probably a similar circuit controlling your phone right now.

Reverse engineering the ARM1, ancestor of the iPhone’s processor

Almost every smartphone uses a processor based on the ARM1 chip created in 1985. The Visual ARM1 simulator shows what happens inside the ARM1 chip as it runs; the result (below) is fascinating but mysterious.[1] In this article, I reverse engineer key parts of the chip and explain how they work, bridging the gap between the puzzling flashing lines in the simulator and what the chip is actually doing. I describethe overall structure of the chip and then descend to the individual transistors, showing how they are built out of silicon and work together to store and process data. After reading this article, you can look at the chip’s circuits and understand the data they store.

Simulation of the ARM1 processor chip.

Screenshot of the Visual ARM1 simulator, showing the activity inside the ARM1 chip as it executes a program.

Overview of the ARM1 chip

The ARM1 chip is built from functional blocks, each with a different purpose. Registers store data, the ALU (arithmetic-logic unit) performs simple arithmetic, instruction decoders determine how to handle each instruction, and so forth. Compared to most processors, the layout of the chip is simple, with each functional block clearly visible. (In comparison, the layout of chips such as the 6502 or Z-80 is highly hand-optimized to avoid any wasted space. In these chips, the functional blocks are squished together, making it harder to pick out the pieces.)

The diagram below shows the most important functional blocks of the ARM chip.[2] The actual processing happens in the bottom half of the chip, which implements the data path. The chip operates on 32 bits at a time so it is structured as 32 horizontal layers: bit 31 at the top, down to bit 0 at the bottom. Several data buses run horizontally to connect different sections of the chip. The large register file, with 25 registers, stands out in the image. The Program Counter (register 15) is on the left of the register file and register 0 is on the right.[3]

The main components of the ARM1 chip. Most of the pins are used for address and data lines; unlabeled pins are various control signals.

The main components of the ARM1 chip. Most of the pins are used for address and data lines; unlabeled pins are various control signals.

Computation takes place in the ALU (arithmetic-logic unit), which is to the right of the registers. The ALU performs 16 different operations (add, add with carry, subtract, logical AND, logical OR, etc.) It takes two 32-bit inputs and produces a 32-bit output. The ALU is described in detail here.[4] To the right of the ALU is the 32-bit barrel shifter. This large component performs a binary shift or rotate operation on its input, and is described in more detail below. At the left is the address circuitry which provides an address to memory through the address pins. At the right data circuitry reads and writes data values to memory.

Above the datapath circuitry is the control circuitry. The control lines run vertically from the control section to the data path circuits below. These signals select registers, tell the ALU what operation to perform, and so forth. The instruction decode circuitry processes each instruction and generates the necessary control signals. The register decode block processes the register select bits in an instruction and generates the control signals to select the desired registers.[5]

The pins

The squares around the outside of the image above are the pads that connect the processor to the outside world. The photo below shows the 84-pin package for the ARM1 processor chip. The gold-plated pins are wired to the pads on the silicon chip inside the package.

The ARM1 processor chip installed in the Acorn ARM Evaluation System. Original photo by Flibble, https://commons.wikimedia.org/wiki/File:Acorn-ARM-Evaluation-System.jpg, CC BY-SA 3.0.

The ARM1 processor chip installed in the Acorn ARM Evaluation System. Full photo by Flibble, CC BY-SA 3.0.

Most of the pads are used for the address and data lines to memory. The chip has 26 address lines, allowing it to access 64MB of memory, and has 32 data lines, allowing it to read or write 32 bits at a time. The address lines are in the lower left and the data lines are in the lower right. As the simulator runs, you can see the address pins step through memory and the data pins read data from memory. The right hand side of the simulator shows the address and data values in hex, e.g. «A:00000020 D:e1a00271». If you know hex, you can easily match these values to the pin states.

Each corner of the chip has a power pin (+) and a ground pin (-), providing 5 volts to run the chip. Various control signals are at the top of the chip. In the simulator, it is easy to spot the the two clock signals that step the chip through its operations (below). The phase 1 and phase 2 clocks alternate, providing a tick-tock rhythm to the chip. In the simulator, the clock runs at a couple cycles per second, while the real chip has a 8MHz clock, more than a million times faster. Finally, note below the manufacturer’s name «ACORN» on the chip in place of pin 82.

The two clock signals for the ARM1 processor chip.

History of the ARM chip

The ARM1 was designed in 1985 by engineers Sophie Wilson (formerly Roger Wilson) and Steve Furber of Acorn Computers. The chip was originally named the Acorn RISC Machine and intended as a coprocessor for the BBC Micro home/educational computer to improve its performance. Only a few hundred ARM1 processors were fabricated, so you might expect ARM to be a forgotten microprocessor, a historical footnote of the 1980s. However, the original ARM1 chip led to the amazingly successful ARM architecture with more than 50 billion ARM chips produced. What happened?

In the early 1980s, academic research suggested that instead of making processor instruction sets more complex, designers would get better performance from a processor that was simple but fast: the Reduced Instruction Set Computer or RISC.[6] The Berkeley and Stanford research papers on RISC inspired the ARM designers to choose a RISC design. In addition, given the small size of the design team at Acorn, a simple RISC chip was a practical choice.[7]

The simplicity of a RISC design is clear when comparing the ARM1 and Intel’s 80386, which came out the same year: the ARM1 had about 25,000 transistors versus 275,000 in the 386.[8] The photos below show the two chips at the same scale; the ARM1 is 50mm2 compared to 104mm2 for the 386. (Twenty years later, an ARM7TDMI core was 0.1mm2; magnified at the same scale it would be the size of this square  vividly illustrating Moore’s law.)

Die photo of the ARM1 processor chip. Courtesy of Computer History Museum. Intel 386 CPU die photo (A80386DX-20). By Pdesousa359, https://commons.wikimedia.org/wiki/File:Intel_A80386DX-20_CPU_Die_Image.jpg (CC BY-SA 3.0)

Die photos of the ARM1 processor and the Intel 386 processor to the same scale. The ARM1 is much smaller and contained 25,000 transistors compared to 275,000 in the 386. The 386 was higher density, with a 1.5 micron process compared to 3 micron for the ARM1. ARM1 photo courtesy of Computer History Museum. Intel A80386DX-20 by Pdesousa359CC BY-SA 3.0.

Because of the ARM1’s small transistor count, the chip used very little power: about 1/10 Watt, compared to nearly 2 Watts for the 386. The combination of high performance and low power consumption made later versions of ARM chip very popular for embedded systems. Apple chose the ARM processor for its ill-fated Newton handheld system and in 1990, Acorn Computers, Apple, and chip manufacturer VLSI Technology formed the company Advanced RISC Machines to continue ARM development.[9]

In the years since then, ARM has become the world’s most-used instruction set with more than 50 billion ARM processors manufactured. The majority of mobile devices use an ARM processor; for instance, the Apple A8 processor inside iPhone 6 uses the 64-bit ARMv8-A. Despite its humble beginnings, the ARM1 made IEEE Spectrum’s list of 25 microchips that shook the world and PC World’s 11 most influential microprocessors of all time.

Looking at the low-level construction of the ARM1 chip

Getting back to the chip itself, the ARM1 chip is constructed from five layers. If you zoom in on the chip in the simulator, you can see the components of the chip, built from these layers. As seen below, the simulator uses a different color for each layer, and highlights circuits that are turned on. The bottom layer is the silicon that makes up the transistors of the chip. During manufacturing, regions of the silicon are modified (doped) by applying different impurities. Silicon can be doped positive to form a PMOS transistor (blue) or doped negative for an NMOS transistor (red). Undoped silicon is basically an insulator (black).

The ARM1 simulator uses different colors to represent the different layers of the chip.

The ARM1 simulator uses different colors to represent the different layers of the chip.

Polysilicon wires (green) are deposited on top of the silicon. When polysilicon crosses doped silicon, it forms the gate of a transistor (yellow). Finally, two layers of metal (gray) are on top of the polysilicon and provide wiring.[10] Black squares are contacts that form connections between the different layers.

For our purposes, a MOS transistor can be thought of as a switch, controlled by the gate. When it is on (closed), the source and drain silicon regions are connected. When it is off (open), the source and drain are disconnected. The diagram below shows the three-dimensional structure of a MOS transistor.

Structure of a MOS transistor.

Structure of a MOS transistor.

Like most modern processors, the ARM1 was built using CMOS technology, which uses two types of transistors: NMOS and PMOS. NMOS transistors turn on when the gate is high, and pull their output towards ground. PMOS transistors turn on when the gate is low, and pull their output towards +5 volts.

Understanding the register file

The register file is a key component of the ARM1, storing information inside the chip. (As a RISC chip, the ARM1 makes heavy use of its registers.) The register file consists of 25 registers, each holding 32 bits. This section describes step-by-step how the register file is built out of individual transistors.

The diagram below shows two transistors forming an inverter. If the input is high (as below), the NMOS transistor (red) turns on, connecting ground to the output so the output is low. If the input is low, the PMOS transistor (blue) turns on, connecting power to the output so the output is high. Thus, the output is the opposite of the input, making an inverter.

An inverter in the ARM1 chip, as displayed by the simulator.

An inverter in the ARM1 chip, as displayed by the simulator.

Combining two inverters into a loop forms a simple storage circuit. If the first inverter outputs 1, the second inverter outputs 0, causing the first inverter to output 1, and the circuit is stable. Likewise, if the first inverter outputs 0, the second outputs 1, and the circuit is again stable. Thus, the circuit will remain in either state indefinitely, «remembering» one bit until forced into a different state.

Two inverters in the ARM1 chip form one bit of register storage.

Two inverters in the ARM1 chip form one bit of register storage.

To make this circuit into a useful register cell, read and write bus lines are added, along with select lines to connect the cell to the bus lines. When the write select line is activated, the pass connector connects the write bus to the inverter, allowing a new value to be overwrite the current bit. Likewise, pass transistors connect the bit to a read bus when activated by the corresponding select line, allowing the stored value to be read out.

Schematic of one bit in the ARM1 processor's register file.

Schematic of one bit in the ARM1 processor’s register file.

To create the register file, the register cell above is repeated 32 times vertically for each bit, and 25 times horizontally to form each register. Each bit has three horizontal bus lines — the write bus and the two read buses — so there are 32 triples of bus lines. Each register has three vertical control lines — the write select line and two read select lines — so there are 25 triples of control lines. By activating the desired control lines, two registers can be read and one register can be written at a time.[11] When the simulator is running, you can see the vertical control lines activated to select registers, and you can see the data bits flowing on the horizontal bus lines.

By looking at a memory cell in the simulator, you can see which inverter is on and determine if the bit is a 0 or a 1. The diagram below shows a few register bits. If the upper inverter input is active, the bit is 0; if the lower inverter input is active, the bit is 1. (Look at the green lines above or below the bit values.) Thus, you can read register values right out of the simulator if you look closely.

By looking at the ARM1 register file, you can determine the value of each bit. For a 0 bit, the input to the top inverter is active (green/yellow); for a 1 bit, the input to the bottom inverter is active.

By looking at the ARM1 register file, you can determine the value of each bit. For a 0 bit, the input to the top inverter is active (green/yellow); for a 1 bit, the input to the bottom inverter is active.

The barrel shifter

The barrel shifter, which performs binary shifts, is another interesting component of the ARM1. Most instructions use the barrel shifter, allowing a binary argument to be shifted left, shifted right, or rotated by any amount (0 to 31 bits). While running the simulator, you can see diagonal lines jumping back and forth in the barrel shifter.

The diagram below shows the structure of the barrel shifter. Bits flows into the shifter vertically with bit 0 on the left and bit 31 on the right. Output bits leave the shifter horizontally with bit 0 on the bottom and bit 31 on top. The diagonal lines visible in the barrel shifter show where the vertical lines are connected to the horizontal lines, generating a shifted output. Different positions of the diagonals result in different shifts. The upper diagonal line shifts bits to the left, and the lower diagonal line shifts bits to the right. For a rotation, both diagonals are active; it may not be immediately obvious but in a rotation part of the word is shifted left and part is shifted right.

Structure of the barrel shifter in the ARM1 chip.

Structure of the barrel shifter in the ARM1 chip.

Zooming in on the barrel shifter shows exactly how it works. It contains a 32 by 32 crossbar grid of transistors, each connecting one vertical line to one horizontal line. The transistor gates are connected by diagonal control lines; transistors along the active diagonal connect the appropriate vertical and horizontal lines. Thus, by activating the appropriate diagonals, the output lines are connected to the input lines, shifted by the desired amounts. Since the chip’s input lines all run horizontally, there are 32 connections between input lines and the corresponding vertical bit lines.

Details of the barrel shifter in the ARM1 chip. Transistors along a specific diagonal are activated to connect the vertical bit lines and output lines. Each input line is connected to a vertical bit line through the indicated connections.

Details of the barrel shifter in the ARM1 chip. Transistors along a specific diagonal are activated to connect the vertical bit lines and output lines. Each input line is connected to a vertical bit line through the indicated connections.

The demonstration program

When you run the simulator, it executes a short hardcoded program that performs shifts of increasing amounts. You don’t need to understand the code, but if you’re curious it is:

0000  E1A0100F mov     r1, pc        @ Some setup
0004  E3A0200C mov     r2, #12
0008  E1B0F002 movs    pc, r2
000C  E1A00000 nop
0010  E1A00000 nop
0014  E3A02001 mov     r2, #1        @ Load register r2 with 1
0018  E3A0100F mov     r1, #15       @ Load r1 with value to shift
001C  E59F300C ldr     r3, pointer
    loop:
0020  E1A00271 ror     r0, r1, r2    @ Rotate r1 by r2 bits, store in r0
0024  E2822001 add     r2, r2, #1    @ Add 1 to r2
0028  E4830004 str     r0, [r3], #4  @ Write result to memory
002C  EAFFFFFB b       loop          @ Branch to loop

Inside the loop, register r1 (0x000f) is rotated to the right by r2 bit positions and the result is stored in register r0. Then r2 is incremented and the shift result written to memory. As the simulator runs, watch as r2 is incremented and as r0 goes through the various values of 4 bits rotated. The A and D values show the address and data pins as instructions are read from memory.

The changing shift values are clearly visible in the barrel shifter, as the diagonal line shifts position. If you zoom in on the register file, you can read out the values of the registers, as described earlier.

Conclusion

The ARM1 processor led to the amazingly successful ARM processor architecture that powers your smart phone. The simple RISC architecture of the ARM1 makes the circuitry of the processor easy to understand, at least compared to a chip such as the 386.[12] The ARM1 simulator provides a fascinating look at what happens inside a processor, and hopefully this article has helped explain what you see in the simulator.

P.S. If you want to read more about ARM1 internals, see Dave Mugridge’s series of posts:
Inside the armv1 Register Bank
Inside the armv1 Register Bank — register selection
Inside the armv1 Read Bus
Inside the ALU of the armv1 — the first ARM microprocessor

Notes and references

[1] I should make it clear that I am not part of the Visual 6502 team that built the ARM1 simulator. More information on the simulator is in the Visual 6502 team’s blog post The Visual ARM1.

[2] The block diagram below shows the components of the chip in more detail. See the ARM Evaluation System manual for an explanation of each part.

Floorplan of the ARM1 chip, from ARM Evaluation System manual. (Bus labels are corrected from original.)

Floorplan of the ARM1 chip, from ARM Evaluation System manual. (Bus labels are corrected from original.)

[3] You may have noticed that the ARM architecture describes 16 registers, but the chip has 25 physical registers. There are 9 «extra» registers because there are extra copies of some registers for use while handling interrupts.

Another interesting thing about the register file is the PC register is missing a few bits. Since the ARM1 uses 26-bit addresses, the top 6 bits are not used. Because all instructions are aligned on a 32-bit boundary, the bottom two address bits in the PC are always zero. These 8 bits are not only unused, they are omitted from the chip entirely.

[4] The ALU doesn’t support multiplication (added in ARM 2) or division (added in ARMv7).

[5] A bit more detail on the decode circuitry. Instruction decoding is done through three separate PLAs. The ALU decode PLA generates control signals for the ALU based on the four operation bits in the instruction. The shift decode PLA generates control signals for the barrel shifter. The instruction decode PLA performs the overall decoding of the instruction. The register decode block consists of three layers. Each layer takes a 4-bit register id and activates the corresponding register. There are three layers because ARM operations use two registers for inputs and a third register for output.

[6] In a RISC computer, the instruction set is restricted to the most-used instructions, which are optimized for high performance and can typically execute in a single clock cycle. Instructions are a fixed size, simplifying the instruction decoding logic. A RISC processor requires much less circuitry for control and instruction decoding, leaving more space on the chip for registers. Most instructions operate on registers, and only load and store instructions access memory. For more information on RISC vs CISC, see RISC architecture.

[7] For details on the history of the ARM1, see Conversation with Steve Furber: The designer of the ARM chip shares lessons on energy-efficient computing.

[8] The 386 and the ARM1 instruction sets are different in many interesting ways. The 386 has instructions from 1 byte to 15 bytes, while all ARM1 instructions are 32-bits long. The 386 has 15 registers — all with special purposes, while the ARM1 has 25 registers, mostly general-purpose. 386 instructions can usually operate on memory, while ARM1 instructions operate on registers except for load and store. The 386 has about 140 different instructions, compared to a couple dozen in the ARM1 (depending how you count). Take a look at the 386 opcode map to see how complex decoding a 386 instruction is. ARM1 instructions fall into 5 categories and can be simply decoded. (I’m not criticizing the 386’s architecture, just pointing out the major architectural differences.)

See the Intel 80386 Programmer’s Reference Manual and 80386 Hardware Reference Manual for more details on the 386 architecture.

[9] Interestingly the ARM company doesn’t manufacture chips. Instead, the ARM intellectual property is licensed to hundreds of different companies that build chips that use the ARM architecture. See The ARM Diaries: How ARM’s business model works for information on how ARM makes money from licensing the chip to other companies.

[10] The first metal layer in the chip runs largely top-to-bottom, while the second metal layer runs predominantly horizontally. Having two layers of metal makes the layout much simpler than single-layer processors such as the 6502 or Z-80.

[11] In the register file, alternating bits are mirrored to simplify the layout. This allows neighboring bits to share power and ground lines. The ARM1’s register file is triple-ported, so two register can be read and one register written at the same time. This is in contrast to chips such as the 6502 or Z-80, which can only access registers one at a time.

[12] For more information on the ARM1 internals, the book VLSI Risc Architecture and Organization by ARM chip designer Steven Furber has a hundred pages of information on the ARM chip internals. An interesting slide deck is A Brief History of ARM by Lee Smith, ARM Fellow.