Researchers discover and abuse new undocumented feature in Intel chipsets

Original text by Catalin Cimpanu

Researchers find new Intel VISA (Visualization of Internal Signals Architecture) debugging technology.

At the Black Hat Asia 2019 security conference, security researchers from Positive Technologies disclosed the existence of a previously unknown and undocumented feature in Intel chipsets.

Called Intel Visualization of Internal Signals Architecture (Intel VISA), Positive Technologies researchers Maxim Goryachy and Mark Ermolov said this is a new utility included in modern Intel chipsets to help with testing and debugging on manufacturing lines.

VISA is included with Platform Controller Hub (PCH) chipsets part of modern Intel CPUs and works like a full-fledged logic signal analyzer.

PCH
Image: Wikimedia Commons

According to the two researchers, VISA intercepts electronic signals sent from internal buses and peripherals (display, keyboard, and webcam) to the PCH —and later the main CPU.

VISA EXPOSES A COMPUTER’S ENTIRE DATA

Unauthorized access to the VISA feature would allow a threat actor to intercept data from the computer memory and create spyware that works at the lowest possible level.

But despite its extremely intrusive nature, very little is known about this new technology. Goryachy and Ermolov said VISA’s documentation is subject to a non-disclosure agreement, and not available to the general public.

Normally, this combination of secrecy and a secure default should keep Intel users safe from possible attacks and abuse.

However, the two researchers said they found several methods of enabling VISA and abusing it to sniff data that passes through the CPU, and even through the secretive Intel Management Engine (ME), which has been housed in the PCH since the release of the Nehalem processors and 5-Series chipsets.

INTEL SAYS IT’S SAFE. RESEARCHERS DISAGREE.

Goryachy and Ermolov said their technique doesn’t require hardware modifications to a computer’s motherboard and no specific equipment to carry out.

The simplest method consists of using the vulnerabilities detailed in Intel’s Intel-SA-00086security advisory to take control of the Intel Management Engine and enable VISA that way.

«The Intel VISA issue, as discussed at BlackHat Asia, relies on physical access and a previously mitigated vulnerability addressed in INTEL-SA-00086 on November 20, 2017,» an Intel spokesperson told ZDNet yesterday.

«Customers who have applied those mitigations are protected from known vectors,» the company said.

However, in an online discussion after his Black Hat talk, Ermolov said the Intel-SA-00086 fixes are not enough, as Intel firmware can be downgraded to vulnerable versions where the attackers can take over Intel ME and later enable VISA.

Furthermore, Ermolov said that there are three other ways to enable Intel VISA, methods that will become public when Black Hat organizers will publish the duo’s presentation slides in the coming days.

As Ermolov said yesterday, VISA is not a vulnerability in Intel chipsets, but just another way in which a useful feature could be abused and turned against users. Chances that VISA will be abused are low. This is because if someone would go through the trouble of exploiting the Intel-SA-00086 vulnerabilities to take over Intel ME, then they’ll likely use that component to carry out their attacks, rather than rely on VISA.

As a side note, this is the second «manufacturing mode» feature Goryachy and Ermolov found in the past year. They also found that Apple accidentally shipped some laptops with Intel CPUs that were left in «manufacturing mode.»

Реклама

GPU side channel attacks can enable spying on web activity, password stealing

( Original text )

Computer scientists at the University of California, Riverside have revealed for the first time how easily attackers can use a computer’s graphics processing unit, or GPU, to spy on web activity, steal passwords, and break into cloud-based applications.

GPU side channel attacks

Threat scenarios

Marlan and Rosemary Bourns College of Engineering computer science doctoral student Hoda Naghibijouybari and post-doctoral researcher Ajaya Neupane, along with Associate Professor Zhiyun Qian and Professor Nael Abu-Ghazaleh, reverse engineered a Nvidia GPU to demonstrate three attacks on both graphics and computational stacks, as well as across them.

All three attacks require the victim to first acquire a malicious program embedded in a downloaded app. The program is designed to spy on the victim’s computer.

Web browsers use GPUs to render graphics on desktops, laptops, and smart phones. GPUs are also used to accelerate applications on the cloud and data centers. Web graphics can expose user information and activity. Computational workloads enhanced by the GPU include applications with sensitive data or algorithms that might be exposed by the new attacks.

GPUs are usually programmed using application programming interfaces, or APIs, such as OpenGL. OpenGL is accessible by any application on a desktop with user-level privileges, making all attacks practical on a desktop. Since desktop or laptop machines by default come with the graphics libraries and drivers installed, the attack can be implemented easily using graphics APIs.

The first attack tracks user activity on the web. When the victim opens the malicious app, it uses OpenGL to create a spy to infer the behavior of the browser as it uses the GPU. Every website has a unique trace in terms of GPU memory utilization due to the different number of objects and different sizes of objects being rendered. This signal is consistent across loading the same website several times and is unaffected by caching.

The researchers monitored either GPU memory allocations over time or GPU performance counters and fed these features to a machine learning based classifier, achieving website fingerprinting with high accuracy. The spy can reliably obtain all allocation events to see what the user has been doing on the web.

In the second attack, the authors extracted user passwords. Each time the user types a character, the whole password textbox is uploaded to GPU as a texture to be rendered. Monitoring the interval time of consecutive memory allocation events leaked the number of password characters and inter-keystroke timing, well-established techniques for learning passwords.

The third attack targets a computational application in the cloud. The attacker launches a malicious computational workload on the GPU which operates alongside the victim’s application. Depending on neural network parameters, the intensity and pattern of contention on the cache, memory and functional units differ over time, creating measurable leakage. The attacker uses machine learning-based classification on performance counter traces to extract the victim’s secret neural network structure, such as number of neurons in a specific layer of a deep neural network.

The researchers reported their findings to Nvidia, who responded that they intend to publish a patch that offers system administrators the option to disable access to performance counters from user-level processes. They also shared a draft of the paper with the AMD and Intel security teams to enable them to evaluate their GPUs with respect to such vulnerabilities.

In the future the group plans to test the feasibility of GPU side channel attacks on Android phones.

CVE-2018-5407 (Flaw in the Intel processor execution engine sharing on SMT (e.g. Hyper-Threading) architectures.) POC ,

Summary

This is a proof-of-concept exploit of the PortSmash microarchitecture attack, tracked by CVE-2018-5407.

Alt text

Setup

Prerequisites

A CPU featuring SMT (e.g. Hyper-Threading) is the only requirement.

This exploit code should work out of the box on Skylake and Kaby Lake. For other SMT architectures, customizing the strategies and/or waiting times in spy is likely needed.

OpenSSL

Download and install OpenSSL 1.1.0h or lower:

cd /usr/local/src
wget https://www.openssl.org/source/openssl-1.1.0h.tar.gz
tar xzf openssl-1.1.0h.tar.gz
cd openssl-1.1.0h/
export OPENSSL_ROOT_DIR=/usr/local/ssl
./config -d shared --prefix=$OPENSSL_ROOT_DIR --openssldir=$OPENSSL_ROOT_DIR -Wl,-rpath=$OPENSSL_ROOT_DIR/lib
make -j8
make test
sudo checkinstall --strip=no --stripso=no --pkgname=openssl-1.1.0h-debug --provides=openssl-1.1.0h-debug --default make install_sw

If you use a different path, you’ll need to make changes to Makefile and sync.sh.

Tooling

freq.sh

Turns off frequency scaling and TurboBoost.

sync.sh

Sync trace through pipes. It has two victims, one of which should be active at a time:

  1. The stock openssl running dgst command to produce a P-384 signature.
  2. A harness ecc that calls scalar multiplication directly with a known key. (Useful for profiling.)

The script will generate a P-384 key pair in secp384r1.pem if it does not already exist.

The script outputs data.bin which is what openssl dgst signed, and you should be able to verify the ECDSA signature data.sig afterwards with

openssl dgst -sha512 -verify secp384r1.pem -signature data.sig data.bin

In the ecc tool case, data.bin and secp384r1.pem are meaningless and data.sig is not created.

For the taskset commands in sync.sh, the cores need to be two logical cores of the same physical core; sanity check with

$ grep '^core id' /proc/cpuinfo
core id		: 0
core id		: 1
core id		: 2
core id		: 3
core id		: 0
core id		: 1
core id		: 2
core id		: 3

So the script is currently configured for logical cores 3 and 7 that both map to physical core 3 (core_id).

spy

Measurement process that outputs measurements in timings.bin. To change the spy strategy, check the port defines in spy.h. Only one strategy should be active at build time.

Note that timings.bin is actually raw clock cycle counter values, not latencies. Look in parse_raw_simple.py to understand the data format if necessary.

ecc

Victim harness for running OpenSSL scalar multiplication with known inputs. Example:

./ecc M 4 deadbeef0123456789abcdef00000000c0ff33

Will execute 4 consecutive calls to EC_POINT_mul with the given hex scalar.

parse_raw_simple.py

Quick and dirty hack to view 1D traces. The top plot is the raw trace. Everything below is a different digital filter of the raw trace for viewing purposes. Zoom and pan are your friends here.

You might have to adjust the CEIL variable if the plots are too aggressively clipped.

Python packages:

sudo apt-get install python-numpy python-matplotlib

Usage

Turn off frequency scaling:

./freq.sh

Make sure everything builds:

make clean
make

Take a measurement:

./sync.sh

View the trace:

python parse_raw_simple.py timings.bin

You can play around with one victim at a time in sync.sh. Sample output for the openssl dgst victim is in parse_raw_simple.png.

Credits

  • Alejandro Cabrera Aldaya (Universidad Tecnológica de la Habana (CUJAE), Habana, Cuba)
  • Billy Bob Brumley (Tampere University of Technology, Tampere, Finland)
  • Sohaib ul Hassan (Tampere University of Technology, Tampere, Finland)
  • Cesar Pereida García (Tampere University of Technology, Tampere, Finland)
  • Nicola Tuveri (Tampere University of Technology, Tampere, Finland)

Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Something About Virtualisation First

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

vm-arch

 

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

  1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
  2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

  1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
  2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
  3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
  4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
  5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
  6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

  1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
  2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
  3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

overall-diag

Retargetable Machine-Code Decompiler: RetDec

RetDec is a retargetable machine-code decompiler based on LLVM. The decompiler is not limited to any particular target architecture, operating system, or executable file format:

  • Supported file formats: ELF, PE, Mach-O, COFF, AR (archive), Intel HEX, and raw machine code.
  • Supported architectures (32b only): Intel x86, ARM, MIPS, PIC32, and PowerPC.

 

Features:

  • Static analysis of executable files with detailed information.
  • Compiler and packer detection.
  • Loading and instruction decoding.
  • Signature-based removal of statically linked library code.
  • Extraction and utilization of debugging information (DWARF, PDB).
  • Reconstruction of instruction idioms.
  • Detection and reconstruction of C++ class hierarchies (RTTI, vtables).
  • Demangling of symbols from C++ binaries (GCC, MSVC, Borland).
  • Reconstruction of functions, types, and high-level constructs.
  • Integrated disassembler.
  • Output in two high-level languages: C and a Python-like language.
  • Generation of call graphs, control-flow graphs, and various statistics.

 

After seven years of development, Avast open-sources its machine-code decompiler for platform-independent analysis of executable files. Avast released its analytical tool, RetDec, to help the cybersecurity community fight malicious software. The tool allows anyone to study the code of applications to see what the applications do, without running them. The goal behind open sourcing RetDec is to provide a generic tool to transform platform-specific code, such as x86/PE executable files, into a higher form of representation, such as C source code. By generic, we mean that the tool should not be limited to a single platform, but rather support a variety of platforms, including different architectures, file formats, and compilers. At Avast, RetDec is actively used for analysis of malicious samples for various platforms, such as x86/PE and ARM/ELF.

 

What is a decompiler?

A decompiler is a program that takes an executable file as its input and attempts to transform it into a high-level representation while preserving its functionality. For example, the input file may be application.exe, and the output can be source code in a higher-level programming language, such as C. A decompiler is, therefore, the exact opposite of a compiler, which compiles source files into executable files; this is why decompilers are sometimes also called reverse compilers.

By preserving a program’s functionality, we want the source code to reflect what the input program does as accurately as possible; otherwise, we risk assuming the program does one thing, when it really does another.

Generally, decompilers are unable to perfectly reconstruct original source code, due to the fact that a lot of information is lost during the compilation process. Furthermore, malware authors often use various obfuscation and anti-decompilation tricks to make the decompilation of their software as difficult as possible.

RetDec addresses the above mentioned issues by using a large set of supported architectures and file formats, as well as in-house heuristics and algorithms to decode and reconstruct applications. RetDec is also the only decompiler of its scale using a proven LLVM infrastructure and provided for free, licensed under MIT.

Decompilers can be used in a variety of situations. The most obvious is reverse engineering when searching for bugs, vulnerabilities, or analyzing malicious software. Decompilation can also be used to retrieve lost source code when comparing two executables, or to verify that a compiled program does exactly what is written in its source code.

There are several important differences between a decompiler and a disassembler. The former tries to reconstruct an executable file into a platform-agnostic, high-level source code, while the latter gives you low-level, platform-specific assembly instructions. The assembly output is non-portable, error-prone when modified, and requires specific knowledge about the instruction set of the target processor. Another positive aspect of decompilers is the high-level source code they produce, like  C source code, which can be read by people who know nothing about the assembly language for the particular processor being analyzed.

 

Installation and Use

Currently,RetDec support only Windows (7 or later) and Linux.

 

Windows

  1. Either download and unpack a pre-built package from the following list, or build and install the decompiler by yourself (the process is described below):
  2. Install Microsoft Visual C++ Redistributable for Visual Studio 2015.
  3. Install MSYS2 and other needed applications by following RetDec’s Windows environment setup guide.
  4. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    bash decompile.sh test.exe
    

    For more information, run bash decompile.sh --help.

 

Linux

  1. There are currently no pre-built packages for Linux. You will have to build and install the decompiler by yourself. The process is described below.
  2. After you have built the decompiler, you will need to install the following packages via your distribution’s package manager:
  3. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    ./decompile.sh test.exe
    

    For more information, run ./decompile.sh --help.

 

Build and Installation


Requirements

Linux

On Debian-based distributions (e.g. Ubuntu), the required packages can be installed with apt-get:

sudo apt-get install build-essential cmake git perl python bash coreutils wget bc graphviz upx flex bison zlib1g-dev libtinfo-dev autoconf pkg-config m4 libtool

 

Windows

  • Microsoft Visual C++ (version >= Visual Studio 2015 Update 2)
  • Git
  • MSYS2 and some other applications. Follow RetDec’s Windows environment setup guide to get everything you need on Windows.
  • Active Perl. It needs to be the first Perl in PATH, or it has to be provided to CMake using CMAKE_PROGRAM_PATH variable, e.g. -DCMAKE_PROGRAM_PATH=/c/perl/bin.
  • Python (version >= 3.4)

 

Process

Warning: Currently, RetDec has to be installed into a clean, dedicated directory. Do NOT install it into /usr,/usr/local, etc. because our build system is not yet ready for system-wide installations. So, when running cmake, always set -DCMAKE_INSTALL_PREFIX=<path> to a directory that will be used just by RetDec. 

  • Recursively clone the repository (it contains submodules):
    • git clone --recursive https://github.com/avast-tl/retdec
  • Linux:
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path>
    • make && make install
  • Windows:
    • Open MSBuild command prompt, or any terminal that is configured to run the msbuild command.
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path> -G<generator>
    • msbuild /m /p:Configuration=Release retdec.sln
    • msbuild /m /p:Configuration=Release INSTALL.vcxproj
    • Alternatively, you can open retdec.sln generated by cmake in Visual Studio IDE.

You have to pass the following parameters to cmake:

  • -DCMAKE_INSTALL_PREFIX=<path> to set the installation path to <path>.
  • (Windows only) -G<generator> is -G"Visual Studio 14 2015" for 32-bit build using Visual Studio 2015, or -G"Visual Studio 14 2015 Win64" for 64-bit build using Visual Studio 2015. Later versions of Visual Studio may be used.

You can pass the following additional parameters to cmake:

  • -DRETDEC_DOC=ON to build with API documentation (requires Doxygen and Graphviz, disabled by default).
  • -DRETDEC_TESTS=ON to build with tests, including all the tests in dependency submodules (disabled by default).
  • -DCMAKE_BUILD_TYPE=Debug to build with debugging information, which is useful during development. By default, the project is built in the Release mode. This has no effect on Windows, but the same thing can be achieved by running msbuild with the /p:Configuration=Debug parameter.
  • -DCMAKE_PROGRAM_PATH=<path> to use Perl at <path> (probably useful only on Windows).

Best SSDs: Q2 2018

 

Since the Holiday 2017 sales, SSD prices in general have declined a small amount. New models using 64-layer 3D TLC NAND flash memory are trickling in, with companies beyond the major NAND manufacturers now offering drives using the latest flash. The most interesting market shifts have been in the growing entry-level NVMe segment, where a new generation of low-cost NVMe controllers has arrived to bring prices even closer to SATA levels while still offering a clear performance advantage.

March 2018 SSD Recommendations
Market Segment Recommendations
Mainstream 2.5″ SATA Crucial MX500 500GB $129.99 (26¢/GB)
Entry-level NVMe MyDigitalSSD SBX 512GB $159.99 (31¢/GB)
High-end NVMe Samsung 960 EVO 1TB $449.99 (45¢/GB)
M.2 SATA WD Blue 3D NAND 2TB $471.41 (24¢/GB)

Above are some recommendations of good deals in each market segment. Several of these aren’t the cheapest option in their segment and instead are quality products worth paying a little extra for.

The next table is a rough summary of what constitutes a good deal on a current model in today’s market. Sales that don’t beat these prices are only worth a second glance if the drive is nicer than average for its product segment.

March 2018 SSD Recommendations: Price to Beat, ¢/GB
Market Segment 128GB 256GB 512GB 1TB 2TB
Mainstream 2.5″ SATA 50 ¢/GB 32 ¢/GB 26 ¢/GB 25 ¢/GB 25 ¢/GB
Entry-level NVMe 47 ¢/GB 37 ¢/GB 31 ¢/GB 33 ¢/GB
High-end NVMe 48 ¢/GB 45 ¢/GB 61 ¢/GB
M.2 SATA 37 ¢/GB 30 ¢/GB 29 ¢/GB 27 ¢/GB 24 ¢/GB

As always, the prices shown are merely a snapshot at the time of writing. We make no attempt to predict when or where the best discounts will be. Instead, this guide should be treated as a baseline against which deals can be compared. All of the drives recommended here are models we have tested in at least one capacity or form factor, but in many cases we have not tested every capacity and form factor. For drives not mentioned in this guide, our SSD Bench database can provide performance information and comparisons.

Mainstream 2.5″ SATA: Crucial MX500WD Blue 3D/SanDisk Ultra 3D

The largest segment of the consumer SSD market is 2.5″ SATA drives intended for use as either the only storage device in the system, or as the primary drive holding the OS and most programs and data. This market segment has by far the widest range of choices, and virtually every SSD brand has at least one model for this segment.

These days, the best options for a mainstream SATA drive are all at least 240GB. This is large enough for the operating system and all your everyday applications and data, but not necessarily enough for a large library of games, movies or photos. Our recommendations in this segment now all use 3D NAND flash. Older models using planar NAND tend to be much slower if they use TLC, and either more expensive or hard to find if they use MLC.

240-256GB 480-525GB 1TB 2TB
Samsung 860 EVO $94.99 (38¢/GB) $149.99 (30¢/GB) $289.99 (29¢/GB) $607.91 (30¢/GB)
Samsung 850 EVO $84.99 (34¢/GB) $162.71 (33¢/GB) $349.99 (35¢/GB) $629.99 (31¢/GB)
WD Blue 3D NAND $79.99 (32¢/GB) $134.99 (27¢/GB) $259.99 (26¢/GB) $499.99 (25¢/GB)
SanDisk Ultra 3D $74.99 (30¢/GB) $129.99 (26¢/GB) $249.99 (25¢/GB) $699.99 (35¢/GB)
Crucial MX500 $79.99 (32¢/GB) $129.99 (26¢/GB) $249.99 (25¢/GB) $499.99 (25¢/GB)
Crucial MX300 $89.99 (33¢/GB) $139.89 (27¢/GB) $264.99 (25¢/GB) $519.99 (25¢/GB)
Crucial BX300 $87.99 (37¢/GB) $144.99 (30¢/GB)
Intel 545s $69.99 (27¢/GB) $137.99 (27¢/GB)

There’s a great sale price on the 256GB Intel 545s today, but otherwise the Crucial MX500 offers the best value with great performance and some of the lowest prices from a mainstream 2.5″ SSD. Anything significantly cheaper than the MX500 is either a short-lived clearance sale or a much slower drive.

NVMe SSDs

The market for consumer NVMe SSDs has broadened enough to be split into entry-level and high-end segments. Drives with low-end PCIe 3 x2 SSD controllers are becoming more common, and some drives with four-lane controllers are competitively priced for that segment.

Almost all consumer NVMe SSDs use the M.2 2280 form factor, but a handful are PCIe add-in cards. The heatsinks on many of the add-in cards tend to increase the price while making no meaningful difference to real-world performance, so our recommendation for NVMe SSDs are all M.2 form factor SSDs.

Samsung’s replacements for the 960 PRO and 960 EVO have not been announced, and while Toshiba’s XG5 offers a preview of what they can offer, a retail version has not been announced. Western Digital/SanDisk have announced their first client NVMe SSDs, but they haven’t started shipping. Most of the recent action in the NVMe market has been in the low-end segment, with the launch of the Intel 760p and the arrival of drives based on the Phison E8 controller.

High-end NVMe: Samsung 960 EVO and Samsung 960 PRO

The Intel Optane SSD 900P raises the bar for high-end SSD performance, but that speed comes at a steep cost. The price per GB of the 900P is more than twice that of the fastest flash-based SSD. The Optane SSD 800P offers slightly lower performance in an M.2 SSD, but at very low capacities and with even higher price per GB. Almost everyone would be better served by a much larger Samsung 960 drive that usually feels just as fast. The 960 PRO’s performance advantage over a 960 EVO of the same capacity can look impressive in benchmark charts, but is not noticeable enough during real-world use to justify the price premium.

High-End NVMe - ATSB - The Destroyer (Data Rate)

This high-end level of performance is currently hard to obtain from a 256GB-class drive: the 250GB 960 EVO is much slower than its larger siblings, and there isn’t a 256GB 960 PRO. Most new competitors in the high-end space will face similar challenges to hitting premium performance levels at this capacity point when using 256Gb or 512Gb 3D TLC NAND.

Above 250GB, the Samsung 960 EVO is plenty fast even for a high-end system, and the extra expense of the 960 PRO is unnecessary.

Samsung 960 EVO 500GB
On Amazon
250GB 500-512GB 1TB 2TB
Samsung 960 EVO $119.99 (48¢/GB) $238.65 (48¢/GB) $449.99 (45¢/GB);
Samsung 960 PRO $302.45 (59¢/GB) $608.99 (59¢/GB) $1253.97 (61¢/GB)
Intel Optane SSD 900P $349.99 (125¢/GB) 529.99 (110¢/GB)

Entry-level NVMe: Intel 760p and MyDigitalSSD SBX

Low-end NVMe controllers from Phison and Silicon Motion have arrived, but so has the Intel 760p with a much nicer SM2262 four-lane, eight-channel controller. The 760p was very competitively priced when it launched, but prices have since climbed up to the level of the Samsung 960 EVO. This may be a short-term effect of the initial supplies being mostly sold out, in which case the 760p should soon return to its launch prices. In the meantime, the MyDigitalSSD SBX uses the Phison E8 controller and Toshiba 3D TLC to hit very low prices for an NVMe SSD. We will have our full review of the SBX ready soon, but for now we can say that its performance is lower than the Intel 760p but still better than SATA drives.

120-128GB 240-256GB 480-512GB 1TB
MyDigitalSSD SBX $59.99 (47¢/GB) $94.99 (37¢/GB) $159.99 (31¢/GB) $339.99 (33¢/GB)
Samsung 960 EVO $119.99 (48¢/GB) $238.65 (48¢/GB) $449.99 (45¢/GB)
Intel SSD 760p $82.74 (65¢/GB) $122.25 (48¢/GB) $224.10 (44¢/GB)

 

M.2 SATA: Crucial MX300 and WD Blue 3D

For notebooks, M.2 SATA has almost completely replaced mSATA. A few notebooks are using the shorter M.2 2242 or 2260 sizes, but most support up to the 80mm length. There are far fewer M.2 SATA options than 2.5″ SATA options, but most of the current top SATA SSDs come in M.2 versions. The Samsung 860 EVO is in the process of replacing the 850 EVO as the fastest drive in this category, and the Crucial MX500’s M.2 variants will be arriving soon.

WD Blue 3D
On Amazon
250-275GB 500-525GB 1TB 2TB
Samsung 860 EVO M.2 $94.99 (38¢/GB) $169.99 (34¢/GB) $289.99 (29¢/GB) $745.87 (37¢/GB)
Crucial MX300 M.2 $89.99 (33¢/GB) $139.99 (27¢/GB) $276.66 (26¢/GB)
WD Blue 3D M.2 $73.99 (30¢/GB) $129.99 (26¢/GB) $270.00 (27¢/GB) $471.41 (24¢/GB)

A Tool To Bypass Windows x64 Driver Signature Enforcement

TDL (Turla Driver Loader) For Bypassing Windows x64 Signature Enforcement

Definition: TDL Driver loader allows bypassing Windows x64 Driver Signature Enforcement.

What are the system requirements and limitations?

It can run on OS x64 Windows 7/8/8.1/10.
As Vista is obsolete so, TDL doesn’t support Vista it only designed for x64 Windows.
Privilege of administrator is required.
Loaded drivers MUST BE specially designed to run as «driverless».
There is No SEH support.
There is also No driver unloading.
Automatically Only ntoskrnl import resolved, else everything is up to you.
It also provides Dummy driver examples.


Differentiate DSEFix and TDL:

As both DSEFix and TDL uses advantages of driver exploit but they have entirely different way of using it.

Benefits of DSEFix: 

It manipulates kernel variable called g_CiEnabled (Vista/7, ntoskrnl.exe) and/or g_CiOptions (8+. CI.DLL).
DSEFix is simple- you need only to turn DSE it off — load your driver nothing else required.
DSEFix is a potential BSOD-generator as it id subject to PatchGuard (KPP) protection.

Advantages of TDL:

It is friendly to PatchGuard as it doesn’t patch any kernel variables.
Shellcode which TDL used can be able to map driver to kernel mode without windows loader.
Non-invasive bypass od DSE is the main advantage of TDL.

There are some disadvantages too:

To run as «driverless» Your driver must be specially created.
Driver should exist in kernel mode as executable code buffer
You can load multiple drivers, if they are not conflicting each other.

Build

TDL contains full source code. You need Microsoft Visual Studio 2015 U1 and later versions if you want to build it. And same as for driver builds there should be Microsoft Windows Driver Kit 8.1.

Download Link: Click Here

AVX — Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD

Advanced Vector Extensions (AVX) — расширение системы команд x86 для микропроцессоров Intel и AMD.

AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.

Улучшения

  • Новая схема кодирования инструкций VEX
  • Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE-инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM-регистрами добавлены новые 256-битные AVX-инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Xeon Phi уже в 2012 году имели векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD-команды с MVEX- и VEX-префиксами, но при этом они не поддерживают AVX.
  • Неразрушающие операции. Набор AVX-инструкций использует трёхоперандный синтаксис. Например, вместо {\displaystyle a=a+b}a=a+b можно использовать {\displaystyle c=a+b}c=a+b, при этом регистр {\displaystyle a}a остаётся неизменённым. В случаях, когда значение {\displaystyle a}a используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший {\displaystyle a}a, из другого регистра или памяти.
  • Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.
  • Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX-инструкции сохраняют прочие преимущества AVX, такие, как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти.
  • Intel рекомендует отказаться от старых SSE инструкций в пользу новых 128-битных AVX-инструкций, даже если достаточно двух операндов.

Новая схема кодирования

Новая схема кодирования инструкций VEX использует VEX-префикс. В настоящий момент существуют два VEX-префикса, длиной 2 и 3 байта. Для 2-хбайтного VEX-префикса первый байт равен 0xC5, для 3-х байтного — 0xC4.

В 64-битном режиме первый байт VEX-префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS.

Длина существующих AVX-инструкций, вместе с VEX-префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.

Новые инструкции

Инструкция Описание
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128 Копирует 32-х-, 64-х- или 128-битный операнд из памяти во все элементы векторного регистра XMM или YMM.
VINSERTF128 Замещает младшую или старшую половину 256-битного регистра YMM значением 128-битного операнда. Другая часть регистра-получателя не изменяется.
VEXTRACTF128 Извлекает младшую или старшую половину 256-битного регистра YMM и копирует в 128-битный операнд-назначение.
VMASKMOVPS, VMASKMOVPD Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными.
VPERMILPS, VPERMILPD Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра).
VPERM2F128 Переставляет 4 128-битных элемента двух 256-битных регистров в 256-битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора.
VZEROALL Обнуляет все YMM-регистры и помечает их как неиспользуемые. Используется при переключении между 128-битным режимом и 256-битным.
VZEROUPPER Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-битным режимом и 256-битным.

Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)

  • PCLMULLQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 00]
  • PCLMULHQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 01]
  • PCLMULLQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 02]
  • PCLMULHQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 03]
  • PCLMULQDQ xmmreg, xmmrm, imm [rmi: 66 0f 3a 44 /r ib]

Применение

Подходит для интенсивных вычислений с плавающей точкой в мультимедиа-программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.

Инструкции и примеры

__m256i _mm256_abs_epi16 (__m256i a)

Synopsis

__m256i _mm256_abs_epi16 (__m256i a)
#include «immintrin.h»
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)

Synopsis

__m256i _mm256_abs_epi32 (__m256i a)
#include «immintrin.h»
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)

Synopsis

__m256i _mm256_abs_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddw
__m256i _mm256_add_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddd
__m256i _mm256_add_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 32-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddq
__m256i _mm256_add_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 64-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddb
__m256i _mm256_add_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddpd
__m256d _mm256_add_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_add_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddps
__m256 _mm256_add_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_add_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpaddsw
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddsb
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddusw
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddusb
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddsubpd
__m256d _mm256_addsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_addsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF (j is even) dst[i+63:i] := a[i+63:i] — b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] + b[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddsubps
__m256 _mm256_addsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_addsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] — b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpalignr
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)

Synopsis

__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
#include «immintrin.h»
Instruction: vpalignr ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by count bytes, and store the low 16 bytes in dst.

Operation

FOR j := 0 to 1 i := j*128 tmp[255:0] := ((a[i+127:i] << 128) OR b[i+127:i]) >> (count[7:0]*8) dst[i+127:i] := tmp[127:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandpd
__m256d _mm256_and_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_and_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandps
__m256 _mm256_and_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_and_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := (a[i+31:i] AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpand
__m256i _mm256_and_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_and_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpand ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandnpd
__m256d _mm256_andnot_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_andnot_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandnpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ((NOT a[i+63:i]) AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandnps
__m256 _mm256_andnot_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_andnot_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandnps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ((NOT a[i+31:i]) AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpandn
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpandn ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.

Operation

dst[255:0] := ((NOT a[255:0]) AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgw
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := (a[i+15:i] + b[i+15:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgb
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := (a[i+7:i] + b[i+7:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpblendw
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 16-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpblendd
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)

Synopsis

__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd xmm, xmm, xmm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpblendd
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vblendpd
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vblendpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[j%8] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vblendps
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vblendps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vpblendvb
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)

Synopsis

__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
#include «immintrin.h»
Instruction: vpblendvb ymm, ymm, ymm, ymm
CPUID Flags: AVX2

Description

Blend packed 8-bit integers from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 IF mask[i+7] dst[i+7:i] := b[i+7:i] ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vblendvpd
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)

Synopsis

__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
#include «immintrin.h»
Instruction: vblendvpd ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vblendvps
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)

Synopsis

__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
#include «immintrin.h»
Instruction: vblendvps ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vbroadcastf128
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

Synopsis

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastf128
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastsd
__m256d _mm256_broadcast_sd (double const * mem_addr)

Synopsis

__m256d _mm256_broadcast_sd (double const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, m64
CPUID Flags: AVX

Description

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

Operation

tmp[63:0] = MEM[mem_addr+63:mem_addr] FOR j := 0 to 3 i := j*64 dst[i+63:i] := tmp[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastss
__m128 _mm_broadcast_ss (float const * mem_addr)

Synopsis

__m128 _mm_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss xmm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 3 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:128] := 0
vbroadcastss
__m256 _mm256_broadcast_ss (float const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss ymm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 7 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vpbroadcastb
__m128i _mm_broadcastb_epi8 (__m128i a)

Synopsis

__m128i _mm_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastb
__m256i _mm256_broadcastb_epi8 (__m128i a)

Synopsis

__m256i _mm256_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastd
__m128i _mm_broadcastd_epi32 (__m128i a)

Synopsis

__m128i _mm_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastd
__m256i _mm256_broadcastd_epi32 (__m128i a)

Synopsis

__m256i _mm256_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastq
__m128i _mm_broadcastq_epi64 (__m128i a)

Synopsis

__m128i _mm_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastq
__m256i _mm256_broadcastq_epi64 (__m128i a)

Synopsis

__m256i _mm256_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
movddup
__m128d _mm_broadcastsd_pd (__m128d a)

Synopsis

__m128d _mm_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: movddup xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
Westmere 1
Nehalem 1
vbroadcastsd
__m256d _mm256_broadcastsd_pd (__m128d a)

Synopsis

__m256d _mm256_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcasti128
__m256i _mm256_broadcastsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_broadcastsi128_si256 (__m128i a)
#include «immintrin.h»
Instruction: vbroadcasti128 ymm, m128
CPUID Flags: AVX2

Description

Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.

Operation

dst[127:0] := a[127:0] dst[255:128] := a[127:0] dst[MAX:256] := 0
vbroadcastss
__m128 _mm_broadcastss_ps (__m128 a)

Synopsis

__m128 _mm_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcastss
__m256 _mm256_broadcastss_ps (__m128 a)

Synopsis

__m256 _mm256_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastw
__m128i _mm_broadcastw_epi16 (__m128i a)

Synopsis

__m128i _mm_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastw
__m256i _mm256_broadcastw_epi16 (__m128i a)

Synopsis

__m256i _mm256_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpslldq
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
__m256 _mm256_castpd_ps (__m256d a)

Synopsis

__m256 _mm256_castpd_ps (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256d to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castpd_si256 (__m256d a)

Synopsis

__m256i _mm256_castpd_si256 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castpd128_pd256 (__m128d a)

Synopsis

__m256d _mm256_castpd128_pd256 (__m128d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128d _mm256_castpd256_pd128 (__m256d a)

Synopsis

__m128d _mm256_castpd256_pd128 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m128d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castps_pd (__m256 a)

Synopsis

__m256d _mm256_castps_pd (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256 to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castps_si256 (__m256 a)

Synopsis

__m256i _mm256_castps_si256 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castps128_ps256 (__m128 a)

Synopsis

__m256 _mm256_castps128_ps256 (__m128 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128 to type __m256; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128 _mm256_castps256_ps128 (__m256 a)

Synopsis

__m128 _mm256_castps256_ps128 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m128. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_castsi128_si256 (__m128i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castsi256_pd (__m256i a)

Synopsis

__m256d _mm256_castsi256_pd (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castsi256_ps (__m256i a)

Synopsis

__m256 _mm256_castsi256_ps (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128i _mm256_castsi256_si128 (__m256i a)

Synopsis

__m128i _mm256_castsi256_si128 (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m128i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
vroundpd
__m256d _mm256_ceil_pd (__m256d a)

Synopsis

__m256d _mm256_ceil_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := CEIL(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_ceil_ps (__m256 a)

Synopsis

__m256 _mm256_ceil_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := CEIL(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmppd
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 1 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmppd
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpps
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpps
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpsd
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmpsd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[63:0] := ( a[63:0] OP b[63:0] ) ? 0xFFFFFFFFFFFFFFFF : 0 dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpss
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpss xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[31:0] := ( a[31:0] OP b[31:0] ) ? 0xFFFFFFFF : 0 dst[127:32] := a[127:32] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vpcmpeqw
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] == b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqd
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqq
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] == b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqb
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] == b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpgtw
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] > b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtd
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] > b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtq
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] > b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpcmpgtb
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] > b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmovsxwd
__m256i _mm256_cvtepi16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 7 i := 32*j k := 16*j dst[i+31:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxwq
__m256i _mm256_cvtepi16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxdq
__m256i _mm256_cvtepi32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxdq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := SignExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtdq2pd
__m256d _mm256_cvtepi32_pd (__m128i a)

Synopsis

__m256d _mm256_cvtepi32_pd (__m128i a)
#include «immintrin.h»
Instruction: vcvtdq2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[m+63:m] := Convert_Int32_To_FP64(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtdq2ps
__m256 _mm256_cvtepi32_ps (__m256i a)

Synopsis

__m256 _mm256_cvtepi32_ps (__m256i a)
#include «immintrin.h»
Instruction: vcvtdq2ps ymm, ymm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpmovsxbw
__m256i _mm256_cvtepi8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbw ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := SignExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbd
__m256i _mm256_cvtepi8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbq
__m256i _mm256_cvtepi8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwd
__m256i _mm256_cvtepu16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 16*j dst[i+31:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwq
__m256i _mm256_cvtepu16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxdq
__m256i _mm256_cvtepu32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxdq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := ZeroExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbw
__m256i _mm256_cvtepu8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbw ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := ZeroExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbd
__m256i _mm256_cvtepu8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbq
__m256i _mm256_cvtepu8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in the low 8 byte sof a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtpd2dq
__m128i _mm256_cvtpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvtpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtpd2ps
__m128 _mm256_cvtpd_ps (__m256d a)

Synopsis

__m128 _mm256_cvtpd_ps (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2ps xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_FP32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtps2dq
__m256i _mm256_cvtps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvtps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvtps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcvtps2pd
__m256d _mm256_cvtps_pd (__m128 a)

Synopsis

__m256d _mm256_cvtps_pd (__m128 a)
#include «immintrin.h»
Instruction: vcvtps2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 32*j dst[i+63:i] := Convert_FP32_To_FP64(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 1
Ivy Bridge 2 1
Sandy Bridge 2 1
vcvttpd2dq
__m128i _mm256_cvttpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvttpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvttpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32_Truncate(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvttps2dq
__m256i _mm256_cvttps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvttps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvttps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32_Truncate(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vdivpd
__m256d _mm256_div_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_div_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vdivpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j dst[i+63:i] := a[i+63:i] / b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 25
Ivy Bridge 35 28
Sandy Bridge 43 44
vdivps
__m256 _mm256_div_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_div_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vdivps ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := a[i+31:i] / b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 13
Ivy Bridge 21 14
Sandy Bridge 29 28
vdpps
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vdpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

Operation

DP(a[127:0], b[127:0], imm8[7:0]) { FOR j := 0 to 3 i := j*32 IF imm8[(4+j)%8] temp[i+31:i] := a[i+31:i] * b[i+31:i] ELSE temp[i+31:i] := 0 FI ENDFOR sum[31:0] := (temp[127:96] + temp[95:64]) + (temp[63:32] + temp[31:0]) FOR j := 0 to 3 i := j*32 IF imm8[j%8] tmpdst[i+31:i] := sum[31:0] ELSE tmpdst[i+31:i] := 0 FI ENDFOR RETURN tmpdst[127:0] } dst[127:0] := DP(a[127:0], b[127:0], imm8[7:0]) dst[255:128] := DP(a[255:128], b[255:128], imm8[7:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 14 2
Ivy Bridge 12 2
Sandy Bridge 12 2
__int16 _mm256_extract_epi16 (__m256i a, const int index)

Synopsis

__int16 _mm256_extract_epi16 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 16-bit integer from a, selected with index, and store the result in dst.

Operation

dst[15:0] := (a[255:0] >> (index * 16))[15:0]
__int32 _mm256_extract_epi32 (__m256i a, const int index)

Synopsis

__int32 _mm256_extract_epi32 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 32-bit integer from a, selected with index, and store the result in dst.

Operation

dst[31:0] := (a[255:0] >> (index * 32))[31:0]
__int64 _mm256_extract_epi64 (__m256i a, const int index)

Synopsis

__int64 _mm256_extract_epi64 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 64-bit integer from a, selected with index, and store the result in dst.

Operation

dst[63:0] := (a[255:0] >> (index * 64))[63:0]
__int8 _mm256_extract_epi8 (__m256i a, const int index)

Synopsis

__int8 _mm256_extract_epi8 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract an 8-bit integer from a, selected with index, and store the result in dst.

Operation

dst[7:0] := (a[255:0] >> (index * 8))[7:0]
vextractf128
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)

Synopsis

__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)

Synopsis

__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextracti128
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextracti128 xmm, ymm, imm
CPUID Flags: AVX2

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vroundpd
__m256d _mm256_floor_pd (__m256d a)

Synopsis

__m256d _mm256_floor_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := FLOOR(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_floor_ps (__m256 a)

Synopsis

__m256 _mm256_floor_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := FLOOR(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vphaddw
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[31:16] + a[15:0] dst[31:16] := a[63:48] + a[47:32] dst[47:32] := a[95:80] + a[79:64] dst[63:48] := a[127:112] + a[111:96] dst[79:64] := b[31:16] + b[15:0] dst[95:80] := b[63:48] + b[47:32] dst[111:96] := b[95:80] + b[79:64] dst[127:112] := b[127:112] + b[111:96] dst[143:128] := a[159:144] + a[143:128] dst[159:144] := a[191:176] + a[175:160] dst[175:160] := a[223:208] + a[207:192] dst[191:176] := a[255:240] + a[239:224] dst[207:192] := b[127:112] + b[143:128] dst[223:208] := b[159:144] + b[175:160] dst[239:224] := b[191:176] + b[207:192] dst[255:240] := b[223:208] + b[239:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphaddd
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vhaddpd
__m256d _mm256_hadd_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hadd_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[127:64] + a[63:0] dst[127:64] := b[127:64] + b[63:0] dst[191:128] := a[255:192] + a[191:128] dst[255:192] := b[255:192] + b[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhaddps
__m256 _mm256_hadd_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hadd_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphaddsw
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[31:16] + a[15:0]) dst[31:16] = Saturate_To_Int16(a[63:48] + a[47:32]) dst[47:32] = Saturate_To_Int16(a[95:80] + a[79:64]) dst[63:48] = Saturate_To_Int16(a[127:112] + a[111:96]) dst[79:64] = Saturate_To_Int16(b[31:16] + b[15:0]) dst[95:80] = Saturate_To_Int16(b[63:48] + b[47:32]) dst[111:96] = Saturate_To_Int16(b[95:80] + b[79:64]) dst[127:112] = Saturate_To_Int16(b[127:112] + b[111:96]) dst[143:128] = Saturate_To_Int16(a[159:144] + a[143:128]) dst[159:144] = Saturate_To_Int16(a[191:176] + a[175:160]) dst[175:160] = Saturate_To_Int16( a[223:208] + a[207:192]) dst[191:176] = Saturate_To_Int16(a[255:240] + a[239:224]) dst[207:192] = Saturate_To_Int16(b[127:112] + b[143:128]) dst[223:208] = Saturate_To_Int16(b[159:144] + b[175:160]) dst[239:224] = Saturate_To_Int16(b[191-160] + b[159-128]) dst[255:240] = Saturate_To_Int16(b[255:240] + b[239:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphsubw
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[15:0] — a[31:16] dst[31:16] := a[47:32] — a[63:48] dst[47:32] := a[79:64] — a[95:80] dst[63:48] := a[111:96] — a[127:112] dst[79:64] := b[15:0] — b[31:16] dst[95:80] := b[47:32] — b[63:48] dst[111:96] := b[79:64] — b[95:80] dst[127:112] := b[111:96] — b[127:112] dst[143:128] := a[143:128] — a[159:144] dst[159:144] := a[175:160] — a[191:176] dst[175:160] := a[207:192] — a[223:208] dst[191:176] := a[239:224] — a[255:240] dst[207:192] := b[143:128] — b[159:144] dst[223:208] := b[175:160] — b[191:176] dst[239:224] := b[207:192] — b[223:208] dst[255:240] := b[239:224] — b[255:240] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vphsubd
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vhsubpd
__m256d _mm256_hsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[63:0] — a[127:64] dst[127:64] := b[63:0] — b[127:64] dst[191:128] := a[191:128] — a[255:192] dst[255:192] := b[191:128] — b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhsubps
__m256 _mm256_hsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphsubsw
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[15:0] — a[31:16]) dst[31:16] = Saturate_To_Int16(a[47:32] — a[63:48]) dst[47:32] = Saturate_To_Int16(a[79:64] — a[95:80]) dst[63:48] = Saturate_To_Int16(a[111:96] — a[127:112]) dst[79:64] = Saturate_To_Int16(b[15:0] — b[31:16]) dst[95:80] = Saturate_To_Int16(b[47:32] — b[63:48]) dst[111:96] = Saturate_To_Int16(b[79:64] — b[95:80]) dst[127:112] = Saturate_To_Int16(b[111:96] — b[127:112]) dst[143:128]= Saturate_To_Int16(a[143:128] — a[159:144]) dst[159:144] = Saturate_To_Int16(a[175:160] — a[191:176]) dst[175:160] = Saturate_To_Int16(a[207:192] — a[223:208]) dst[191:176] = Saturate_To_Int16(a[239:224] — a[255:240]) dst[207:192] = Saturate_To_Int16(b[143:128] — b[159:144]) dst[223:208] = Saturate_To_Int16(b[175:160] — b[191:176]) dst[239:224] = Saturate_To_Int16(b[207:192] — b[223:208]) dst[255:240] = Saturate_To_Int16(b[239:224] — b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpgatherdd
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256imask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256mask, const int scale)

Synopsis

__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
vpgatherqd
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
vpgatherqq
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128mask, const int scale)

Synopsis

__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)

Synopsis

__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*16 dst[sel+15:sel] := i[15:0]
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)

Synopsis

__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*32 dst[sel+31:sel] := i[31:0]
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)

Synopsis

__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*64 dst[sel+63:sel] := i[63:0]
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)

Synopsis

__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*8 dst[sel+7:sel] := i[7:0]
vinsertf128
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)

Synopsis

__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE imm8[7:0] of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)

Synopsis

__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)

Synopsis

__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinserti128
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)

Synopsis

__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vinserti128 ymm, ymm, xmm, imm
CPUID Flags: AVX2

Description

Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vlddqu
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vlddqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovapd
__m256d _mm256_load_pd (double const * mem_addr)

Synopsis

__m256d _mm256_load_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovapd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovaps
__m256 _mm256_load_ps (float const * mem_addr)

Synopsis

__m256 _mm256_load_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovaps ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqa
__m256i _mm256_load_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_load_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqa ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovupd
__m256d _mm256_loadu_pd (double const * mem_addr)

Synopsis

__m256d _mm256_loadu_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovupd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovups
__m256 _mm256_loadu_ps (float const * mem_addr)

Synopsis

__m256 _mm256_loadu_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovups ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqu
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)

Synopsis

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)

Synopsis

__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

Synopsis

__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
vpmaddwd
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i+16]*b[i+31:i+16] + a[i+15:i]*b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaddubsw
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaskmovd
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovd xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovd
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovd ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovq xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovq ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vmaskmovpd
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)

Synopsis

__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovpd xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovpd
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)

Synopsis

__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovpd ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)

Synopsis

__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovps xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)

Synopsis

__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovps ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vpmaskmovd
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovd m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovd
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovd m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovq m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovq m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vmaskmovpd
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)

Synopsis

void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
#include «immintrin.h»
Instruction: vmaskmovpd m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovpd
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)

Synopsis

void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
#include «immintrin.h»
Instruction: vmaskmovpd m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)

Synopsis

void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
#include «immintrin.h»
Instruction: vmaskmovps m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)

Synopsis

void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
#include «immintrin.h»
Instruction: vmaskmovps m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vpmaxsw
__m256i _mm256_max_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxsd
__m256i _mm256_max_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxsb
__m256i _mm256_max_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxuw
__m256i _mm256_max_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxud
__m256i _mm256_max_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxub
__m256i _mm256_max_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmaxpd
__m256d _mm256_max_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_max_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmaxpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MAX(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmaxps
__m256 _mm256_max_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_max_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmaxps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MAX(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpminsw
__m256i _mm256_min_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsd
__m256i _mm256_min_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsb
__m256i _mm256_min_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminuw
__m256i _mm256_min_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminud
__m256i _mm256_min_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminub
__m256i _mm256_min_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vminpd
__m256d _mm256_min_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_min_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vminpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MIN(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vminps
__m256 _mm256_min_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_min_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vminps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MIN(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmovddup
__m256d _mm256_movedup_pd (__m256d a)

Synopsis

__m256d _mm256_movedup_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovddup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.

Operation

dst[63:0] := a[63:0] dst[127:64] := a[63:0] dst[191:128] := a[191:128] dst[255:192] := a[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovshdup
__m256 _mm256_movehdup_ps (__m256 a)

Synopsis

__m256 _mm256_movehdup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovshdup ymm, ymm
CPUID Flags: AVX

Description

Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[63:32] dst[63:32] := a[63:32] dst[95:64] := a[127:96] dst[127:96] := a[127:96] dst[159:128] := a[191:160] dst[191:160] := a[191:160] dst[223:192] := a[255:224] dst[255:224] := a[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovsldup
__m256 _mm256_moveldup_ps (__m256 a)

Synopsis

__m256 _mm256_moveldup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovsldup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[31:0] dst[63:32] := a[31:0] dst[95:64] := a[95:64] dst[127:96] := a[95:64] dst[159:128] := a[159:128] dst[191:160] := a[159:128] dst[223:192] := a[223:192] dst[255:224] := a[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpmovmskb
int _mm256_movemask_epi8 (__m256i a)

Synopsis

int _mm256_movemask_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpmovmskb r32, ymm
CPUID Flags: AVX2

Description

Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[j] := a[i+7] ENDFOR

Performance

Architecture Latency Throughput
Haswell 3
vmovmskpd
int _mm256_movemask_pd (__m256d a)

Synopsis

int _mm256_movemask_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovmskpd r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

Operation

FOR j := 0 to 3 i := j*64 IF a[i+63] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:4] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmovmskps
int _mm256_movemask_ps (__m256 a)

Synopsis

int _mm256_movemask_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovmskps r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:8] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmpsadbw
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vmpsadbw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from bstarting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.

Operation

MPSADBW(a[127:0], b[127:0], imm8[2:0]) { a_offset := imm8[2]*32 b_offset := imm8[1:0]*32 FOR j := 0 to 7 i := j*8 k := a_offset+i l := b_offset tmp[i+15:i] := ABS(a[k+7:k] — b[l+7:l]) + ABS(a[k+15:k+8] — b[l+15:l+8]) + ABS(a[k+23:k+16] — b[l+23:l+16]) + ABS(a[k+31:k+24] — b[l+31:l+24]) ENDFOR RETURN tmp[127:0] } dst[127:0] := MPSADBW(a[127:0], b[127:0], imm8[2:0]) dst[255:128] := MPSADBW(a[255:128], b[255:128], imm8[5:3]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 2
vpmuldq
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmuludq
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuludq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vmulpd
__m256d _mm256_mul_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_mul_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmulpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] * b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vmulps
__m256 _mm256_mul_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_mul_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmulps ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vpmulhw
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulhuw
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
vpmulhrsw
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhrsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1 dst[i+15:i] := tmp[16:1] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmullw
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmullw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulld
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulld ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 7 i := j*32 tmp[63:0] := a[i+31:i] * b[i+31:i] dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 10 1
vorpd
__m256d _mm256_or_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_or_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] BITWISE OR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vorps
__m256 _mm256_or_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_or_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] BITWISE OR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpor
__m256i _mm256_or_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_or_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] OR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpacksswb
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpacksswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_Int8 (a[15:0]) dst[15:8] := Saturate_Int16_To_Int8 (a[31:16]) dst[23:16] := Saturate_Int16_To_Int8 (a[47:32]) dst[31:24] := Saturate_Int16_To_Int8 (a[63:48]) dst[39:32] := Saturate_Int16_To_Int8 (a[79:64]) dst[47:40] := Saturate_Int16_To_Int8 (a[95:80]) dst[55:48] := Saturate_Int16_To_Int8 (a[111:96]) dst[63:56] := Saturate_Int16_To_Int8 (a[127:112]) dst[71:64] := Saturate_Int16_To_Int8 (b[15:0]) dst[79:72] := Saturate_Int16_To_Int8 (b[31:16]) dst[87:80] := Saturate_Int16_To_Int8 (b[47:32]) dst[95:88] := Saturate_Int16_To_Int8 (b[63:48]) dst[103:96] := Saturate_Int16_To_Int8 (b[79:64]) dst[111:104] := Saturate_Int16_To_Int8 (b[95:80]) dst[119:112] := Saturate_Int16_To_Int8 (b[111:96]) dst[127:120] := Saturate_Int16_To_Int8 (b[127:112]) dst[135:128] := Saturate_Int16_To_Int8 (a[143:128]) dst[143:136] := Saturate_Int16_To_Int8 (a[159:144]) dst[151:144] := Saturate_Int16_To_Int8 (a[175:160]) dst[159:152] := Saturate_Int16_To_Int8 (a[191:176]) dst[167:160] := Saturate_Int16_To_Int8 (a[207:192]) dst[175:168] := Saturate_Int16_To_Int8 (a[223:208]) dst[183:176] := Saturate_Int16_To_Int8 (a[239:224]) dst[191:184] := Saturate_Int16_To_Int8 (a[255:240]) dst[199:192] := Saturate_Int16_To_Int8 (b[143:128]) dst[207:200] := Saturate_Int16_To_Int8 (b[159:144]) dst[215:208] := Saturate_Int16_To_Int8 (b[175:160]) dst[223:216] := Saturate_Int16_To_Int8 (b[191:176]) dst[231:224] := Saturate_Int16_To_Int8 (b[207:192]) dst[239:232] := Saturate_Int16_To_Int8 (b[223:208]) dst[247:240] := Saturate_Int16_To_Int8 (b[239:224]) dst[255:248] := Saturate_Int16_To_Int8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpackssdw
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackssdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_Int16 (a[31:0]) dst[31:16] := Saturate_Int32_To_Int16 (a[63:32]) dst[47:32] := Saturate_Int32_To_Int16 (a[95:64]) dst[63:48] := Saturate_Int32_To_Int16 (a[127:96]) dst[79:64] := Saturate_Int32_To_Int16 (b[31:0]) dst[95:80] := Saturate_Int32_To_Int16 (b[63:32]) dst[111:96] := Saturate_Int32_To_Int16 (b[95:64]) dst[127:112] := Saturate_Int32_To_Int16 (b[127:96]) dst[143:128] := Saturate_Int32_To_Int16 (a[159:128]) dst[159:144] := Saturate_Int32_To_Int16 (a[191:160]) dst[175:160] := Saturate_Int32_To_Int16 (a[223:192]) dst[191:176] := Saturate_Int32_To_Int16 (a[255:224]) dst[207:192] := Saturate_Int32_To_Int16 (b[159:128]) dst[223:208] := Saturate_Int32_To_Int16 (b[191:160]) dst[239:224] := Saturate_Int32_To_Int16 (b[223:192]) dst[255:240] := Saturate_Int32_To_Int16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackuswb
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackuswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_UnsignedInt8 (a[15:0]) dst[15:8] := Saturate_Int16_To_UnsignedInt8 (a[31:16]) dst[23:16] := Saturate_Int16_To_UnsignedInt8 (a[47:32]) dst[31:24] := Saturate_Int16_To_UnsignedInt8 (a[63:48]) dst[39:32] := Saturate_Int16_To_UnsignedInt8 (a[79:64]) dst[47:40] := Saturate_Int16_To_UnsignedInt8 (a[95:80]) dst[55:48] := Saturate_Int16_To_UnsignedInt8 (a[111:96]) dst[63:56] := Saturate_Int16_To_UnsignedInt8 (a[127:112]) dst[71:64] := Saturate_Int16_To_UnsignedInt8 (b[15:0]) dst[79:72] := Saturate_Int16_To_UnsignedInt8 (b[31:16]) dst[87:80] := Saturate_Int16_To_UnsignedInt8 (b[47:32]) dst[95:88] := Saturate_Int16_To_UnsignedInt8 (b[63:48]) dst[103:96] := Saturate_Int16_To_UnsignedInt8 (b[79:64]) dst[111:104] := Saturate_Int16_To_UnsignedInt8 (b[95:80]) dst[119:112] := Saturate_Int16_To_UnsignedInt8 (b[111:96]) dst[127:120] := Saturate_Int16_To_UnsignedInt8 (b[127:112]) dst[135:128] := Saturate_Int16_To_UnsignedInt8 (a[143:128]) dst[143:136] := Saturate_Int16_To_UnsignedInt8 (a[159:144]) dst[151:144] := Saturate_Int16_To_UnsignedInt8 (a[175:160]) dst[159:152] := Saturate_Int16_To_UnsignedInt8 (a[191:176]) dst[167:160] := Saturate_Int16_To_UnsignedInt8 (a[207:192]) dst[175:168] := Saturate_Int16_To_UnsignedInt8 (a[223:208]) dst[183:176] := Saturate_Int16_To_UnsignedInt8 (a[239:224]) dst[191:184] := Saturate_Int16_To_UnsignedInt8 (a[255:240]) dst[199:192] := Saturate_Int16_To_UnsignedInt8 (b[143:128]) dst[207:200] := Saturate_Int16_To_UnsignedInt8 (b[159:144]) dst[215:208] := Saturate_Int16_To_UnsignedInt8 (b[175:160]) dst[223:216] := Saturate_Int16_To_UnsignedInt8 (b[191:176]) dst[231:224] := Saturate_Int16_To_UnsignedInt8 (b[207:192]) dst[239:232] := Saturate_Int16_To_UnsignedInt8 (b[223:208]) dst[247:240] := Saturate_Int16_To_UnsignedInt8 (b[239:224]) dst[255:248] := Saturate_Int16_To_UnsignedInt8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackusdw
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackusdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_UnsignedInt16 (a[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (a[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (a[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (a[127:96]) dst[79:64] := Saturate_Int32_To_UnsignedInt16 (b[31:0]) dst[95:80] := Saturate_Int32_To_UnsignedInt16 (b[63:32]) dst[111:96] := Saturate_Int32_To_UnsignedInt16 (b[95:64]) dst[127:112] := Saturate_Int32_To_UnsignedInt16 (b[127:96]) dst[143:128] := Saturate_Int32_To_UnsignedInt16 (a[159:128]) dst[159:144] := Saturate_Int32_To_UnsignedInt16 (a[191:160]) dst[175:160] := Saturate_Int32_To_UnsignedInt16 (a[223:192]) dst[191:176] := Saturate_Int32_To_UnsignedInt16 (a[255:224]) dst[207:192] := Saturate_Int32_To_UnsignedInt16 (b[159:128]) dst[223:208] := Saturate_Int32_To_UnsignedInt16 (b[191:160]) dst[239:224] := Saturate_Int32_To_UnsignedInt16 (b[223:192]) dst[255:240] := Saturate_Int32_To_UnsignedInt16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpermilpd
__m128d _mm_permute_pd (__m128d a, int imm8)

Synopsis

__m128d _mm_permute_pd (__m128d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permute_pd (__m256d a, int imm8)

Synopsis

__m256d _mm256_permute_pd (__m256d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] IF (imm8[2] == 0) dst[191:128] := a[191:128] IF (imm8[2] == 1) dst[191:128] := a[255:192] IF (imm8[3] == 0) dst[255:192] := a[191:128] IF (imm8[3] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m128 _mm_permute_ps (__m128 a, int imm8)

Synopsis

__m128 _mm_permute_ps (__m128 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permute_ps (__m256 a, int imm8)

Synopsis

__m256 _mm256_permute_ps (__m256 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vperm2f128
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)

Synopsis

__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)

Synopsis

__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)

Synopsis

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2i128
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vperm2i128 ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermq
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpermq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermpd
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)

Synopsis

__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vpermpd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vpermilpd
__m128d _mm_permutevar_pd (__m128d a, __m128i b)

Synopsis

__m128d _mm_permutevar_pd (__m128d a, __m128i b)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)

Synopsis

__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] IF (b[129] == 0) dst[191:128] := a[191:128] IF (b[129] == 1) dst[191:128] := a[255:192] IF (b[193] == 0) dst[255:192] := a[191:128] IF (b[193] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermilps
__m128 _mm_permutevar_ps (__m128 a, __m128i b)

Synopsis

__m128 _mm_permutevar_ps (__m128 a, __m128i b)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)

Synopsis

__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[159:128] := SELECT4(a[255:128], b[129:128]) dst[191:160] := SELECT4(a[255:128], b[161:160]) dst[223:192] := SELECT4(a[255:128], b[193:192]) dst[255:224] := SELECT4(a[255:128], b[225:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermd
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)

Synopsis

__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
#include «immintrin.h»
Instruction: vpermd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermps
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Synopsis

__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
#include «immintrin.h»
Instruction: vpermps ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vrcpps
__m256 _mm256_rcp_ps (__m256 a)

Synopsis

__m256 _mm256_rcp_ps (__m256 a)
#include «immintrin.h»
Instruction: vrcpps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0/a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vroundpd
__m256d _mm256_round_pd (__m256d a, int rounding)

Synopsis

__m256d _mm256_round_pd (__m256d a, int rounding)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ROUND(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_round_ps (__m256 a, int rounding)

Synopsis

__m256 _mm256_round_ps (__m256 a, int rounding)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ROUND(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vrsqrtps
__m256 _mm256_rsqrt_ps (__m256 a)

Synopsis

__m256 _mm256_rsqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vrsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0 / SQRT(a[i+31:i])) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vpsadbw
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsadbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

Operation

FOR j := 0 to 31 i := j*8 tmp[i+7:i] := ABS(a[i+7:i] — b[i+7:i]) ENDFOR FOR j := 0 to 4 i := j*64 dst[i+15:i] := tmp[i+7:i] + tmp[i+15:i+8] + tmp[i+23:i+16] + tmp[i+31:i+24] + tmp[i+39:i+32] + tmp[i+47:i+40] + tmp[i+55:i+48] + tmp[i+63:i+56] dst[i+63:i+16] := 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values.

Operation

dst[15:0] := e0 dst[31:16] := e1 dst[47:32] := e2 dst[63:48] := e3 dst[79:64] := e4 dst[95:80] := e5 dst[111:96] := e6 dst[127:112] := e7 dst[145:128] := e8 dst[159:144] := e9 dst[175:160] := e10 dst[191:176] := e11 dst[207:192] := e12 dst[223:208] := e13 dst[239:224] := e14 dst[255:240] := e15 dst[MAX:256] := 0
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e0 dst[15:8] := e1 dst[23:16] := e2 dst[31:24] := e3 dst[39:32] := e4 dst[47:40] := e5 dst[55:48] := e6 dst[63:56] := e7 dst[71:64] := e8 dst[79:72] := e9 dst[87:80] := e10 dst[95:88] := e11 dst[103:96] := e12 dst[111:104] := e13 dst[119:112] := e14 dst[127:120] := e15 dst[135:128] := e16 dst[143:136] := e17 dst[151:144] := e18 dst[159:152] := e19 dst[167:160] := e20 dst[175:168] := e21 dst[183:176] := e22 dst[191:184] := e23 dst[199:192] := e24 dst[207:200] := e25 dst[215:208] := e26 dst[223:216] := e27 dst[231:224] := e28 dst[239:232] := e29 dst[247:240] := e30 dst[255:248] := e31 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)

Synopsis

__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)

Synopsis

__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)

Synopsis

__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set1_epi16 (short a)

Synopsis

__m256i _mm256_set1_epi16 (short a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi32 (int a)

Synopsis

__m256i _mm256_set1_epi32 (int a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi64x (long long a)

Synopsis

__m256i _mm256_set1_epi64x (long long a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi8 (char a)

Synopsis

__m256i _mm256_set1_epi8 (char a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
__m256d _mm256_set1_pd (double a)

Synopsis

__m256d _mm256_set1_pd (double a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast double-precision (64-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256 _mm256_set1_ps (float a)

Synopsis

__m256 _mm256_set1_ps (float a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast single-precision (32-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values in reverse order.

Operation

dst[15:0] := e15 dst[31:16] := e14 dst[47:32] := e13 dst[63:48] := e12 dst[79:64] := e11 dst[95:80] := e10 dst[111:96] := e9 dst[127:112] := e8 dst[145:128] := e7 dst[159:144] := e6 dst[175:160] := e5 dst[191:176] := e4 dst[207:192] := e3 dst[223:208] := e2 dst[239:224] := e1 dst[255:240] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e31 dst[15:8] := e30 dst[23:16] := e29 dst[31:24] := e28 dst[39:32] := e27 dst[47:40] := e26 dst[55:48] := e25 dst[63:56] := e24 dst[71:64] := e23 dst[79:72] := e22 dst[87:80] := e21 dst[95:88] := e20 dst[103:96] := e19 dst[111:104] := e18 dst[119:112] := e17 dst[127:120] := e16 dst[135:128] := e15 dst[143:136] := e14 dst[151:144] := e13 dst[159:152] := e12 dst[167:160] := e11 dst[175:168] := e10 dst[183:176] := e9 dst[191:184] := e8 dst[199:192] := e7 dst[207:200] := e6 dst[215:208] := e5 dst[223:216] := e4 dst[231:224] := e3 dst[239:232] := e2 dst[247:240] := e1 dst[255:248] := e0 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)

Synopsis

__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)

Synopsis

__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)

Synopsis

__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
vxorpd
__m256d _mm256_setzero_pd (void)

Synopsis

__m256d _mm256_setzero_pd (void)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256d with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_setzero_ps (void)

Synopsis

__m256 _mm256_setzero_ps (void)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256 with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_setzero_si256 (void)

Synopsis

__m256i _mm256_setzero_si256 (void)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256i with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpshufd
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshufb
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpshufb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vshufpd
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vshufpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

dst[63:0] := (imm8[0] == 0) ? a[63:0] : a[127:64] dst[127:64] := (imm8[1] == 0) ? b[63:0] : b[127:64] dst[191:128] := (imm8[2] == 0) ? a[191:128] : a[255:192] dst[255:192] := (imm8[3] == 0) ? b[191:128] : b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vshufps
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vshufps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(b[127:0], imm8[5:4]) dst[127:96] := SELECT4(b[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(b[255:128], imm8[5:4]) dst[255:224] := SELECT4(b[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpshufhw
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufhw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[63:0] := a[63:0] dst[79:64] := (a >> (imm8[1:0] * 16))[79:64] dst[95:80] := (a >> (imm8[3:2] * 16))[79:64] dst[111:96] := (a >> (imm8[5:4] * 16))[79:64] dst[127:112] := (a >> (imm8[7:6] * 16))[79:64] dst[191:128] := a[191:128] dst[207:192] := (a >> (imm8[1:0] * 16))[207:192] dst[223:208] := (a >> (imm8[3:2] * 16))[207:192] dst[239:224] := (a >> (imm8[5:4] * 16))[207:192] dst[255:240] := (a >> (imm8[7:6] * 16))[207:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshuflw
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshuflw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[15:0] := (a >> (imm8[1:0] * 16))[15:0] dst[31:16] := (a >> (imm8[3:2] * 16))[15:0] dst[47:32] := (a >> (imm8[5:4] * 16))[15:0] dst[63:48] := (a >> (imm8[7:6] * 16))[15:0] dst[127:64] := a[127:64] dst[143:128] := (a >> (imm8[1:0] * 16))[143:128] dst[159:144] := (a >> (imm8[3:2] * 16))[143:128] dst[175:160] := (a >> (imm8[5:4] * 16))[143:128] dst[191:176] := (a >> (imm8[7:6] * 16))[143:128] dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpsignw
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 15 i := j*16 IF b[i+15:i] < 0 dst[i+15:i] := NEG(a[i+15:i]) ELSE IF b[i+15:i] = 0 dst[i+15:i] := 0 ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignd
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 7 i := j*32 IF b[i+31:i] < 0 dst[i+31:i] := NEG(a[i+31:i]) ELSE IF b[i+31:i] = 0 dst[i+31:i] := 0 ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignb
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 31 i := j*8 IF b[i+7:i] < 0 dst[i+7:i] := NEG(a[i+7:i]) ELSE IF b[i+7:i] = 0 dst[i+7:i] := 0 ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsllw
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
Haswell 4
vpslld
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllq
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllw
__m256i _mm256_slli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Haswell 1
vpslld
__m256i _mm256_slli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllq
__m256i _mm256_slli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpslldq
__m256i _mm256_slli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_slli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvd
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvd
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvq
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvq
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsqrtpd
__m256d _mm256_sqrt_pd (__m256d a)

Synopsis

__m256d _mm256_sqrt_pd (__m256d a)
#include «immintrin.h»
Instruction: vsqrtpd ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := SQRT(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 28
Ivy Bridge 35 28
Sandy Bridge 43 44
vsqrtps
__m256 _mm256_sqrt_ps (__m256 a)

Synopsis

__m256 _mm256_sqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SQRT(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 14
Ivy Bridge 21 14
Sandy Bridge 29 28
vpsraw
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrad
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsraw
__m256i _mm256_srai_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrad
__m256i _mm256_srai_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsravd
__m128i _mm_srav_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srav_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsravd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsravd
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsravd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlw
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrld
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlq
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlw
__m256i _mm256_srli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrld
__m256i _mm256_srli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlq
__m256i _mm256_srli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_srli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_srli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvd
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvq
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvq
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmovapd
void _mm256_store_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_store_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovapd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovaps
void _mm256_store_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_store_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovaps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqa
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqa m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovupd
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_storeu_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovupd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovups
void _mm256_storeu_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_storeu_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovups m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqu
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqu m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)

Synopsis

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)

Synopsis

void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Synopsis

void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
vmovntdqa
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)

Synopsis

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
#include «immintrin.h»
Instruction: vmovntdqa ymm, m256
CPUID Flags: AVX2

Description

Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovntpd
void _mm256_stream_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_stream_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovntpd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntps
void _mm256_stream_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_stream_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovntps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntdq
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovntdq m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vpsubw
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] — b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubd
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubq
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubb
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] — b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsubpd
__m256d _mm256_sub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_sub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vsubps
__m256 _mm256_sub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_sub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpsubsw
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubsb
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusw
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusb
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vtestpd
int _mm_testc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testnzc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testnzc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testnzc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testnzc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testnzc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testnzc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testnzc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testnzc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testnzc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testnzc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testz_pd (__m128d a, __m128d b)

Synopsis

int _mm_testz_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testz_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testz_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testz_ps (__m128 a, __m128 b)

Synopsis

int _mm_testz_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testz_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testz_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testz_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testz_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
__m256d _mm256_undefined_pd (void)

Synopsis

__m256d _mm256_undefined_pd (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256d with undefined elements.
__m256 _mm256_undefined_ps (void)

Synopsis

__m256 _mm256_undefined_ps (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256 with undefined elements.
__m256i _mm256_undefined_si256 (void)

Synopsis

__m256i _mm256_undefined_si256 (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256i with undefined elements.
vpunpckhwd
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[79:64] dst[31:16] := src2[79:64] dst[47:32] := src1[95:80] dst[63:48] := src2[95:80] dst[79:64] := src1[111:96] dst[95:80] := src2[111:96] dst[111:96] := src1[127:112] dst[127:112] := src2[127:112] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhdq
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhqdq
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhbw
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[71:64] dst[15:8] := src2[71:64] dst[23:16] := src1[79:72] dst[31:24] := src2[79:72] dst[39:32] := src1[87:80] dst[47:40] := src2[87:80] dst[55:48] := src1[95:88] dst[63:56] := src2[95:88] dst[71:64] := src1[103:96] dst[79:72] := src2[103:96] dst[87:80] := src1[111:104] dst[95:88] := src2[111:104] dst[103:96] := src1[119:112] dst[111:104] := src2[119:112] dst[119:112] := src1[127:120] dst[127:120] := src2[127:120] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpckhpd
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpckhpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpckhps
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpckhps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpunpcklwd
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[15:0] dst[31:16] := src2[15:0] dst[47:32] := src1[31:16] dst[63:48] := src2[31:16] dst[79:64] := src1[47:32] dst[95:80] := src2[47:32] dst[111:96] := src1[63:48] dst[127:112] := src2[63:48] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckldq
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklqdq
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklbw
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[7:0] dst[15:8] := src2[7:0] dst[23:16] := src1[15:8] dst[31:24] := src2[15:8] dst[39:32] := src1[23:16] dst[47:40] := src2[23:16] dst[55:48] := src1[31:24] dst[63:56] := src2[31:24] dst[71:64] := src1[39:32] dst[79:72] := src2[39:32] dst[87:80] := src1[47:40] dst[95:88] := src2[47:40] dst[103:96] := src1[55:48] dst[111:104] := src2[55:48] dst[119:112] := src1[63:56] dst[127:120] := src2[63:56] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpcklpd
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpcklpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpcklps
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpcklps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorpd
__m256d _mm256_xor_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_xor_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] XOR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_xor_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_xor_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] XOR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_xor_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_xor_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] XOR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vzeroall
void _mm256_zeroall (void)

Synopsis

void _mm256_zeroall (void)
#include «immintrin.h»
Instruction: vzeroall
CPUID Flags: AVX

Description

Zero the contents of all XMM or YMM registers.

Operation

YMM0[MAX:0] := 0 YMM1[MAX:0] := 0 YMM2[MAX:0] := 0 YMM3[MAX:0] := 0 YMM4[MAX:0] := 0 YMM5[MAX:0] := 0 YMM6[MAX:0] := 0 YMM7[MAX:0] := 0 IF 64-bit mode YMM8[MAX:0] := 0 YMM9[MAX:0] := 0 YMM10[MAX:0] := 0 YMM11[MAX:0] := 0 YMM12[MAX:0] := 0 YMM13[MAX:0] := 0 YMM14[MAX:0] := 0 YMM15[MAX:0] := 0 FI
vzeroupper
void _mm256_zeroupper (void)

Synopsis

void _mm256_zeroupper (void)
#include «immintrin.h»
Instruction: vzeroupper
CPUID Flags: AVX

Description

Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.

Operation

YMM0[MAX:128] := 0 YMM1[MAX:128] := 0 YMM2[MAX:128] := 0 YMM3[MAX:128] := 0 YMM4[MAX:128] := 0 YMM5[MAX:128] := 0 YMM6[MAX:128] := 0 YMM7[MAX:128] := 0 IF 64-bit mode YMM8[MAX:128] := 0 YMM9[MAX:128] := 0 YMM10[MAX:128] := 0 YMM11[MAX:128] := 0 YMM12[MAX:128] := 0 YMM13[MAX:128] := 0 YMM14[MAX:128] := 0 YMM15[MAX:128] := 0 FI

Performance

Architecture Latency Throughput
Haswell 0 1
Ivy Bridge 0 1
Sandy Bridge 0 1

SSE4 — CPU instruction set used in the Intel Core microarchitecture and AMD

SSE4 — набор команд микроархитектуры Intel Core

SSE4 состоит из 54 инструкций, 47 из них относят к SSE4.1 (они есть в процессорах Penryn). Полный набор команд (SSE4.1 и SSE4.2, то есть 47 + оставшиеся 7 команд) доступен в процессорах Intel с микроархитектурой Nehalem, которые были выпущены в середине ноября 2008 года и более поздних редакциях. Ни одна из SSE4 инструкций не работает с 64-х битными mmx регистрами (только со 128-ми битными xmm0-15).

Компилятор языка Си от Intel начиная с версии 10 генерирует инструкции SSE4 при задании опции -QxS. Компилятор Sun Studio от Sun Microsystems с версии 12 update 1 генерирует инструкции SSE4 с помощью опций -xarch=sse4_1 (SSE4.1) и -xarch=sse4_2 (SSE4.2). Компилятор GCC поддерживает SSE4.1 и SSE4.2 с версии 4.3, опции -msse4.1 и -msse4.2, или -msse4, включающая оба варианта.

Инструкции SSE4.1
Ускорение видео

  • MPSADBW xmm1, xmm2/m128, imm8 — (Multiple Packed Sums of Absolute Difference)
    • Input — { A0, A1,… A14 }, { B0, B1,… B15 }, Shiftmode
    • Output — { SAD0, SAD1, SAD2,… SAD7 }

Вычисление восьми сумм абсолютных значений разностей (SAD) смещённых 4-х байтных беззнаковых групп. Расположение операндов для 16-ти битных SAD определяется 3-мя битами непосредственного аргумента imm8.

s1 = imm8[2]*4
s2 = imm8[1:0]*4
SAD0 = |A(s1+0)-B(s2+0)| + |A(s1+1)-B(s2+1)| + |A(s1+2)-B(s2+2)| + |A(s1+3)-B(s2+3)|
SAD1 = |A(s1+1)-B(s2+0)| + |A(s1+2)-B(s2+1)| + |A(s1+3)-B(s2+2)| + |A(s1+4)-B(s2+3)|
SAD2 = |A(s1+2)-B(s2+0)| + |A(s1+3)-B(s2+1)| + |A(s1+4)-B(s2+2)| + |A(s1+5)-B(s2+3)|
...
SAD7 = |A(s1+7)-B(s2+0)| + |A(s1+8)-B(s2+1)| + |A(s1+9)-B(s2+2)| + |A(s1+10)-B(s2+3)|
  • PHMINPOSUW xmm1, xmm2/m128 — (Packed Horizontal Word Minimum)
    • Input — { A0, A1,… A7 }
    • Output — { MinVal, MinPos, 0, 0… }

Поиск среди 16-ти битных беззнаковых полей A0…A7 такого, который имеет минимальное значение (и позицию с меньшим номером, если таких полей несколько). Возвращается 16-ти битное значение и его позиция.

  • PMOV{SX,ZX}{B,W,D} xmm1, xmm2/m{64,32,16} — (Packed Move with Sign/Zero Extend)

Группа из 12-ти инструкций для расширения формата упакованных полей. Упакованные 8, 16, или 32-х битные поля из младшей части аргумента расширяются (со знаком или без) в 16, 32 или 64-х битные поля результата.

Входной формат Результирующий
формат
8 бит 16 бит 32 бита
PMOVSXBW 16 бит
PMOVZXBW
PMOVSXBD PMOVSXWD 32 бита
PMOVZXBD PMOVZXWD
PMOVSXBQ PMOVSXWQ PMOVSXDQ 64 бита
PMOVZXBQ PMOVZXWQ PMOVZXDQ

Векторные примитивы

  • P{MIN,MAX}{SB,UW,SD,UD} xmm1, xmm2/m128 — (Minimum/Maximum of Packed Signed/Unsigned Byte/Word/DWord Integers)

Каждое поле результата есть минимальное/максимальное значение соответствующих полей двух аргументов. Байтовые поля рассматриваются только как числа со знаком, 16-ти битные — только как числа без знака. Для 32-х битных упакованных полей предусмотрен вариант как со знаком, так и без.

  • PMULDQ xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { A0*B0, A2*B2 }
      Перемножение 32-х битных полей со знаком с выдачей полных 64-х бит результата (две операции умножения над 0 и 2 полями аргументов).
  • PMULLD xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers and Store Low Result)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { low32(A0*B0), low32(A1*B1), low32(A2*B2), low32(A3*B3)
      Перемножение 32-х битных полей со знаком с выдачей младших 32-х бит результатов (четыре операции умножения над всеми полями аргументов).
  • PACKUSDW xmm1, xmm2/m128 — (Pack with Unsigned Saturation)
    Упаковка 32-х битных полей со знаком в 16-ти битные поля без знака с насыщением.
  • PCMPEQQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Equal)
    Проверка 64-х битных полей на равенство и выдача 64-х битных масок.

    ставки/извлечения

    • INSERTPS xmm1, xmm2/m32, imm8 — (Insert Packed Single Precision Floating-Point Value)

    Вставка 32-х битного поля из xmm2 (возможно выбрать любой из 4 полей этого регистра) или из 32-х битной ячейки памяти в произвольное поле результата. Кроме того, для каждого из полей результата можно задать сброс его в +0.0.

    • EXTRACTPS r/m32, xmm, imm8 — (Extract Packed Single Precision Floating-Point Value)

    Извлечение 32-х битного поля из xmm регистра, номер поля указывается в младших 2 битах imm8. Если в качестве результата указан 64-х битный регистр, то его старшие 32 бита сбрасываются (расширение без знака).

    • PINSR{B,D,Q} xmm, r/m*, imm8 — (Insert Byte/Dword/Qword)

    Вставка 8, 32, или 64-х битного значения в указанное поле xmm регистра (остальные поля не изменяются).

    • PEXTR{B,W,D,Q} r/m*, xmm, imm8 — (Extract Byte/Word/Dword/Qword)

    Извлечение 8, 16, 32, 64 битного поля из указанного в imm8 поля xmm регистра. Если в качестве результата указан регистр, то его старшая часть сбрасывается (расширение без знака).

    Скалярное умножение векторов

    • DPPS xmm1, xmm2/m128, imm8 — (Dot Product of Packed Single Precision Floating-Point Values)
    • DPPD xmm1, xmm2/m128, imm8 — (Dot Product of Packed Double Precision Floating-Point Values)

    Скалярное умножение векторов (dot product) 32/64 битных полей. Посредством битовой маски в imm8 указывается, какие произведения полей должны суммироваться и что следует прописать в каждое поле результата: сумму указанных произведений или +0.0.

    Смешивания

    • BLENDV{PS,PD} xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Single/Double Precision Floating-Point Values)

    Выбор каждого 32/64-битного поля результата осуществляется в зависимости от знака такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • BLEND{PS,PD} xmm1, xmm2/m128, imm8 — (Blend Packed Single/Double Precision Floating-Point Values)

    Битовая маска (4 или 2 бита) в imm8 указывает из какого аргумента следует взять каждое 32/64-битное поле результата.

    • PBLENDVB xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Bytes)

    Выбор каждого байтового поля результата осуществляется в зависимости от знака байта такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • PBLENDW xmm1, xmm2/m128, imm8 — (Blend Packed Words)

    Битовая маска (8 бит) в imm8 указывает из какого аргумента следует взять каждое 16-битное поле результата.

    Проверки бит

    • PTEST xmm1, xmm2/m128 — (Logical Compare)

    Установить флаг ZF, если только в xmm2/m128 все биты помеченные маской из xmm1 равны нулю. Если все не помеченные биты равны нулю, то установить флаг CF. Остальные флаги (AF, OF, PF, SF) всегда сбрасываются. Инструкция не модифицирует xmm1.

    Округления

    • ROUND{PS, PD} xmm1, xmm2/m128, imm8 — (Round Packed Single/Double Precision Floating-Point Values)

    Округление всех 32/64-х битных полей. Режим округления (4 варианта) выбирается либо из MXCSR.RC, либо задаётся непосредственно в imm8. Также можно подавить генерацию исключения потери точности.

    • ROUND{SS, SD} xmm1, xmm2/m128, imm8 — (Round Scalar Single/Double Precision Floating-Point Values)

    Округление только младшего 32/64-х битного поля (остальные биты остаются неизменными).

    Чтение WC памяти

    • MOVNTDQA xmm1, m128 — (Load Double Quadword Non-Temporal Aligned Hint)

    Операция чтения, позволяющая ускорить (до 7.5 раз) работу с write-combining областями памяти.

    Новые инструкции SSE4.2

    Обработка строк

    Эти инструкции выполняют арифметические сравнения между всеми возможными парами полей (64 или 256 сравнений!) из обеих строк, заданных содержимым xmm1 и xmm2/m128. Затем булевые результаты сравнений обрабатываются для получения нужных результатов. Непосредственный аргумент imm8 управляет размером (байтовые или unicode строки, до 16/8 элементов каждая), знаковостью полей (элементов строк), типом сравнения и интерпретацией результатов.

    Ими можно производить в строке (области памяти) поиск символов из заданного набора или в заданных диапазонах. Можно сравнивать строки (области памяти) или производить поиск подстрок.

    Все они оказывают влияние на флаги процессора: SF устанавливается если в xmm1 не полная строка, ZF — если в xmm2/m128 не полная строка, CF — если результат не нулевой, OF — если младший бит результата не нулевой. Флаги AF и PF сбрасываются.

    • PCMPESTRI <ecx>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре ecx.

    • PCMPESTRM <xmm0>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре xmm0.

    • PCMPISTRI <ecx>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре ecx.

    • PCMPISTRM <xmm0>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре xmm0.

    Подсчет CRC32

    • CRC32 r32, r/m* — (Подсчет CRC32)

    Накопление значения CRC-32C (другие обозначения CRC-32/ISCSI CRC-32/CASTAGNOLI) для 8, 16, 32 или 64 битного аргумента (используется полином 0x1EDC6F41).

    Подсчет популяции единичных битов

    • POPCNT r, r/m* — (Return the Count of Number of Bits Set to 1)

    Подсчет числа единичных битов. Три варианта инструкции: для 16, 32 и 64-х битных регистров. Также присутствует в SSE4A от AMD.

    Векторные примитивы

    • PCMPGTQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Greater Than)

    Проверка 64-х битных полей на «больше чем» и выдача 64-х битных масок.

    SSE4a

    Набор инструкций SSE4a был введен компанией AMD в процессоры на архитектуре Barcelona. Это расширение не доступно в процессорах Intel. Поддержка определяется через CPUID.80000001H:ECX.SSE4A[Bit 6] флаг.

    Инструкция Описание
    LZCNT/POPCNT Подсчет числа нулевых/единичных битов.
    EXTRQ/INSERTQ Комбинированные инструкции маскирования и сдвига
    MOVNTSD/MOVNTSS Скалярные инструкции потоковой записи

    AMD

Intel CPU security features

List of Intel CPU security features along with short descriptions taken from the Intel manuals.

WP (Write Protect) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR0.WP allows pages to be protected from supervisor-mode writes. If CR0.WP = 0, supervisor-mode write accesses are allowed to linear addresses with read-only access rights; if CR0.WP = 1, they are not (User-mode write accesses are never allowed to linear addresses with read-only access rights, regardless of the value of CR0.WP).

Interesting links:

NXE/XD (No-Execute Enable/Execute Disable) (PDF)

Regarding IA32_EFER MSR and NXE (Volume 3A, 4-3, Paragraph 4.1.3):

IA32_EFER.NXE enables execute-disable access rights for PAE paging and IA-32e paging. If IA32_EFER.NXE = 1, instructions fetches can be prevented from specified linear addresses (even if data reads from the addresses are allowed).

IA32_EFER.NXE has no effect with 32-bit paging. Software that wants to use this feature to limit instruction fetches from readable pages must use either PAE paging or IA-32e paging.

Regarding XD (Volume 3A, 4-17, Table 4-11):

If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0).

SMAP (Supervisor Mode Access Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can override this protection by setting EFLAGS.AC.

SMEP (Supervisor Mode Execution Protection) (PDF)

Quoting Volume 3A, 4-3, Paragraph 4.1.3:

CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.

MPX (Memory Protection Extensions) (PDF)

Intel MPX introduces new bounds registers and new instructions that operate on bounds registers. Intel MPX allows an OS to support user mode software (operating at CPL = 3) and supervisor mode software (CPL < 3) to add memory protection capability against buffer overrun. It provides controls to enable Intel MPX extensions for user mode and supervisor mode independently. Intel MPX extensions are designed to allow software to associate bounds with pointers, and allow software to check memory references against the bounds associated with the pointer to prevent out of bound memory access (thus preventing buffer overflow).

Interesting links:

SGX (Software Guard Extensions) (PDF)

These extensions allow an application to instantiate a protected container, referred to as an enclave. An enclave is a protected area in the application’s address space (see Figure 1-1), which provides confidentiality and integrity even in the presence of privileged malware. Accesses to the enclave memory area from any software not resident in the enclave are prevented.

Interesting links:

Protection keys (PDF)

Quoting Volume 3A, 4-31, Paragraph 4.6.2:

The protection-key feature provides an additional mechanism by which IA-32e paging controls access to user-mode addresses. When CR4.PKE = 1, every linear address is associated with the 4-bit protection key located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address. The PKRU register determines, for each protection key, whether user-mode addresses with that protection key may be read or written.

The following paragraphs, taken from LWN, shed some light on the purpose of memory protection keys:

One might well wonder why this feature is needed when everything it does can be achieved with the memory-protection bits that already exist. The problem with the current bits is that they can be expensive to manipulate. A change requires invalidating translation lookaside buffer (TLB) entries across the entire system, which is bad enough, but changing the protections on a region of memory can require individually changing the page-table entries for thousands (or more) pages. Instead, once the protection keys are set, a region of memory can be enabled or disabled with a single register write. For any application that frequently changes the protections on regions of its address space, the performance improvement will be large.

There is still the question (as asked by Ingo Molnar) of just why a process would want to make this kind of frequent memory-protection change. There would appear to be a few use cases driving this development. One is the handling of sensitive cryptographic data. A network-facing daemon could use a cryptographic key to encrypt data to be sent over the wire, then disable access to the memory holding the key (and the plain-text data) before writing the data out. At that point, there is no way that the daemon can leak the key or the plain text over the wire; protecting sensitive data in this way might also make applications a bit more resistant to attack.

Another commonly mentioned use case is to protect regions of data from being corrupted by «stray» write operations. An in-memory database could prevent writes to the actual data most of the time, enabling them only briefly when an actual change needs to be made. In this way, database corruption due to bugs could be fended off, at least some of the time. Ingo was unconvinced by this use case; he suggested that a 64-bit address space should be big enough to hide data in and protect it from corruption. He also suggested that a version of mprotect() that optionally skipped TLB invalidation could address many of the performance issues, especially if huge pages were used. Alan Cox responded, though, that there is real-world demand for the ability to change protection on gigabytes of memory at a time, and that mprotect() is simply too slow.

CET (Control-flow Enforcement Technology) (PDF)

Control-flow Enforcement Technology (CET) provides the following capabilities to defend against ROP/JOP style control-flow subversion attacks:

  • Shadow Stack – return address protection to defend against Return Oriented Programming,
  • Indirect branch tracking – free branch protection to defend against Jump/Call Oriented Programming.