Hypervisor From Scratch – Part 2: Entering VMX Operation

Original text bySinaei )

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

12345678910NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

12345678910111213141516171819202122232425262728293031323334353637383940414243NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

123456789101112131415161718192021222324252627282930#define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:

1234567891011121314151617181920212223IRP_MN_START_DEVICEIRP_MN_QUERY_STOP_DEVICEIRP_MN_STOP_DEVICEIRP_MN_CANCEL_STOP_DEVICEIRP_MN_QUERY_REMOVE_DEVICEIRP_MN_REMOVE_DEVICEIRP_MN_CANCEL_REMOVE_DEVICEIRP_MN_SURPRISE_REMOVALIRP_MN_QUERY_CAPABILITIES IRP_MN_QUERY_PNP_DEVICE_STATEIRP_MN_FILTER_RESOURCE_REQUIREMENTSIRP_MN_DEVICE_USAGE_NOTIFICATIONIRP_MN_QUERY_DEVICE_RELATIONSIRP_MN_QUERY_RESOURCESIRP_MN_QUERY_RESOURCE_REQUIREMENTSIRP_MN_QUERY_IDIRP_MN_QUERY_DEVICE_TEXTIRP_MN_QUERY_BUS_INFORMATIONIRP_MN_QUERY_INTERFACEIRP_MN_READ_CONFIGIRP_MN_WRITE_CONFIGIRP_MN_DEVICE_ENUMERATEDIRP_MN_SET_LOCK

Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately. 

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

123456789101112131415161718192021222324252627282930typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Defined Headers

I created the following headers (source.h) for my driver.

12345678910111213141516171819202122232425262728293031323334#pragma once#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

Now just compile your driver.

Loading Driver and Check the presence of Device

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

Enable Capturing Event

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Load Driver in OSR Driver Loader

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.

WinObj

The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

  1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
  2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
  3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

12prlkd> x nt!kd_Default_Maskfffff801`f5211808 nt!Kd_DEFAULT_Mask = <no type information>

Now change it value to 0xffffffff.

1lkd> eb fffff801`f5211808 ff ff ff ff
kd_DEFAULT_Mask

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.

DbgView

Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

12345678910111213141516171819202122232425262728293031323334353637string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

1234567891011121314151617181920bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

123456789101112131415161718192021222324252627int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

User-mode app

Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

  •  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
  •  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
  •  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

Just in Header File (in my case Source.h) declare your function:

1extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

1234567891011Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

1PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

123456NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Final Show

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.

Conclusion

In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch/]

References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

Реклама

GPU side channel attacks can enable spying on web activity, password stealing

( Original text )

Computer scientists at the University of California, Riverside have revealed for the first time how easily attackers can use a computer’s graphics processing unit, or GPU, to spy on web activity, steal passwords, and break into cloud-based applications.

GPU side channel attacks

Threat scenarios

Marlan and Rosemary Bourns College of Engineering computer science doctoral student Hoda Naghibijouybari and post-doctoral researcher Ajaya Neupane, along with Associate Professor Zhiyun Qian and Professor Nael Abu-Ghazaleh, reverse engineered a Nvidia GPU to demonstrate three attacks on both graphics and computational stacks, as well as across them.

All three attacks require the victim to first acquire a malicious program embedded in a downloaded app. The program is designed to spy on the victim’s computer.

Web browsers use GPUs to render graphics on desktops, laptops, and smart phones. GPUs are also used to accelerate applications on the cloud and data centers. Web graphics can expose user information and activity. Computational workloads enhanced by the GPU include applications with sensitive data or algorithms that might be exposed by the new attacks.

GPUs are usually programmed using application programming interfaces, or APIs, such as OpenGL. OpenGL is accessible by any application on a desktop with user-level privileges, making all attacks practical on a desktop. Since desktop or laptop machines by default come with the graphics libraries and drivers installed, the attack can be implemented easily using graphics APIs.

The first attack tracks user activity on the web. When the victim opens the malicious app, it uses OpenGL to create a spy to infer the behavior of the browser as it uses the GPU. Every website has a unique trace in terms of GPU memory utilization due to the different number of objects and different sizes of objects being rendered. This signal is consistent across loading the same website several times and is unaffected by caching.

The researchers monitored either GPU memory allocations over time or GPU performance counters and fed these features to a machine learning based classifier, achieving website fingerprinting with high accuracy. The spy can reliably obtain all allocation events to see what the user has been doing on the web.

In the second attack, the authors extracted user passwords. Each time the user types a character, the whole password textbox is uploaded to GPU as a texture to be rendered. Monitoring the interval time of consecutive memory allocation events leaked the number of password characters and inter-keystroke timing, well-established techniques for learning passwords.

The third attack targets a computational application in the cloud. The attacker launches a malicious computational workload on the GPU which operates alongside the victim’s application. Depending on neural network parameters, the intensity and pattern of contention on the cache, memory and functional units differ over time, creating measurable leakage. The attacker uses machine learning-based classification on performance counter traces to extract the victim’s secret neural network structure, such as number of neurons in a specific layer of a deep neural network.

The researchers reported their findings to Nvidia, who responded that they intend to publish a patch that offers system administrators the option to disable access to performance counters from user-level processes. They also shared a draft of the paper with the AMD and Intel security teams to enable them to evaluate their GPUs with respect to such vulnerabilities.

In the future the group plans to test the feasibility of GPU side channel attacks on Android phones.

Researchers Defeat AMD’s SEV Virtual Machine Encryption

Researchers defeat AMD’s Secure Encrypted Virtualization (SEV), demonstrating #SEVered attack that could allow malicious hypervisor to steal plain-text data from an encrypted virtual machine.

German security researchers claim to have found a new practical attack against virtual machines (VMs) protected using AMD’s Secure Encrypted Virtualization (SEV) technology that could allow attackers to recover plaintext memory data from guest VMs.

AMD’s Secure Encrypted Virtualization (SEV) technology, which comes with EPYC line of processors, is a hardware feature that encrypts the memory of each VM in a way that only the guest itself can access the data, protecting it from other VMs/containers and even from an untrusted hypervisor.

Discovered by researchers from the Fraunhofer Institute for Applied and Integrated Security in Munich, the page-fault side channel attack, dubbed SEVered, takes advantage of lack in the integrity protection of the page-wise encryption of the main memory, allowing a malicious hypervisor to extract the full content of the main memory in plaintext from SEV-encrypted VMs.

Here’s the outline of the SEVered attack, as briefed in the paper: SEVered: Subverting AMD’s Virtual Machine Encryption

«While the VM’s Guest Virtual Address (GVA) to Guest Physical Address (GPA) translation is controlled by the VM itself and opaque to the HV, the HV remains responsible for the Second Level Address Translation (SLAT), meaning that it maintains the VM’s GPA to Host Physical Address (HPA) mapping in main memory.

«This enables us to change the memory layout of the VM in the HV. We use this capability to trick a service in the VM, such as a web server, into returning arbitrary pages of the VM in plaintext upon the request of a resource from outside.»

«We first identify the encrypted pages in memory corresponding to the resource, which the service returns as a response to a specific request. By repeatedly sending requests for the same resource to the service while re-mapping the identified memory pages, we extract all the VM’s memory in plaintext.»

During their tests, the team was able to extract a test server’s entire 2GB memory data, which also included data from another guest VM.

In their experimental setup, the researchers used a with the Linux-based system powered by an AMD Epyc 7251 processor with SEV enabled, running web services—the Apache and Nginx web servers—as well as an SSH server, OpenSSH web server in separate VMs.

 

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

AMD Gaming Evolved exploiting

Background

For anyone running an AMD GPU from a few years back, you’ve probably come across a piece of software installed on your computer from Raptr, Inc. If you don’t remember installing it, it’s because for several years it was installed silently along-side your AMD drivers. The software was marketed to the gaming community and labeled AMD Gaming Evolved. While I haven’t ever actually used the software, I’ve gathered that it allowed you to tweak your GPU as well as record your gameplay using another application called playstv.

I personally discovered the software while performing a routine check of what software running on my PC was listening for inbound connections. I try to make it a point to at least give a minimal amount of attention to any software I find accepting connections from outside of my PC. However, when I originally discovered this, my free time was scarce so I just made a note of it and uninstalled the software. The following screenshot shows the plays_service.exe binary listening on all interfaces on what appears to be an ephemeral port.

Fast forward two years, I update my AMD drivers and notice plays_service.exe” has shown up on my computer again. This time I decide to give it a little more attention.

Reversing – Windows Service

Opening up plays_service.exe in IDA, we see the usual boiler plate service code and trace it down to the main entry point. From here we almost immediately recognize that this application is python based and has been packaged with something like py2exe. While decompiling python byte code is rather trivial, the trick with these types of executables is identifying and locating the python classes. Python byte-code in a py2exe packaged binary is typically embedded in the executable or loaded from some relative path on disk. At this point, I usually open up the strings subview in IDA to see if anything obvious jumps out.

I see at least a few interesting string references that are worth investigating. Several of them look like they may have something to do with the initialization of python. The first string I track down is “Unable to create Python obj for executable name!” . At first glance it appears to be an error message if certain python objects aren’t created properly. Scrolling up in the function it references, I see the following code.

This function appears to be the python setup routine. Returning to my list of strings, I see several references to zip.

%s%cpython%d%d.zip
zipimport
cannot import zipimport module
zipimporter

I decided to search through the install directory and see if there were any zip files present. Success, only one zip file exists and it is named python35.zip! It’s filename also matches the format string of one of the string references above. I unzip the file and peruse its contents. The zip file contains thousands of compiled bytecode python files which I presume to be the applications core source code and library dependencies.

Reversing – Compiled Python

Looking through the compiled python files, I see three that may be the service’s source code.

I decompiled each of the files using uncompyle6 and opened them up in a text editor. The largest of the three, plays_service.pyc, turned out to be the main service source. The service is a basic HTTP server made up of a few simple classes. It binds to an ephermal port on startup and writes the port to the registry to be used by the greater application. The POST request handler code is listed below.

The handler expects a JSON formatted POST request with a couple of parameters. The first is the data parameter which holds the command to be processed. The second is a hash value of the data provided and a secret key. Lucky for us, the secret key just so happens to be hard-coded in the class definition. If the computed hash matches the one provided, the handler calls one of two defined command function, “extract_files” or “execute_installer”. From here I began to look at the “execute_installer” function because the name sounded quite promising.

The function logic is pretty straight forward. It performs a couple insignificant checks, resolves two paths passed as parameters to the POST request, and then calls CreateProcess. The most important detail of note is that while it looks like a fully controlled command injection is possible, the calls to win32api.GetShortPathName throw an exception if the parameter passed does not resolve to a file. This limits the exploitation of this vulnerability significantly but still allows for privilege escalation to SYSTEM and remote compromise using anonymous outbound SMB.

Exploit

Exploiting this “feature” for file execution didn’t take a significant amount of work. The only real requirements were properly setting up the POST request and hashing the right portion of data. A proof of concept for achieving file execution with this vulnerability (CVE-2018-6546) can be found here.

AVX — Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD

Advanced Vector Extensions (AVX) — расширение системы команд x86 для микропроцессоров Intel и AMD.

AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.

Улучшения

  • Новая схема кодирования инструкций VEX
  • Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE-инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM-регистрами добавлены новые 256-битные AVX-инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Xeon Phi уже в 2012 году имели векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD-команды с MVEX- и VEX-префиксами, но при этом они не поддерживают AVX.
  • Неразрушающие операции. Набор AVX-инструкций использует трёхоперандный синтаксис. Например, вместо {\displaystyle a=a+b}a=a+b можно использовать {\displaystyle c=a+b}c=a+b, при этом регистр {\displaystyle a}a остаётся неизменённым. В случаях, когда значение {\displaystyle a}a используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший {\displaystyle a}a, из другого регистра или памяти.
  • Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.
  • Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX-инструкции сохраняют прочие преимущества AVX, такие, как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти.
  • Intel рекомендует отказаться от старых SSE инструкций в пользу новых 128-битных AVX-инструкций, даже если достаточно двух операндов.

Новая схема кодирования

Новая схема кодирования инструкций VEX использует VEX-префикс. В настоящий момент существуют два VEX-префикса, длиной 2 и 3 байта. Для 2-хбайтного VEX-префикса первый байт равен 0xC5, для 3-х байтного — 0xC4.

В 64-битном режиме первый байт VEX-префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS.

Длина существующих AVX-инструкций, вместе с VEX-префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.

Новые инструкции

Инструкция Описание
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128 Копирует 32-х-, 64-х- или 128-битный операнд из памяти во все элементы векторного регистра XMM или YMM.
VINSERTF128 Замещает младшую или старшую половину 256-битного регистра YMM значением 128-битного операнда. Другая часть регистра-получателя не изменяется.
VEXTRACTF128 Извлекает младшую или старшую половину 256-битного регистра YMM и копирует в 128-битный операнд-назначение.
VMASKMOVPS, VMASKMOVPD Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными.
VPERMILPS, VPERMILPD Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра).
VPERM2F128 Переставляет 4 128-битных элемента двух 256-битных регистров в 256-битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора.
VZEROALL Обнуляет все YMM-регистры и помечает их как неиспользуемые. Используется при переключении между 128-битным режимом и 256-битным.
VZEROUPPER Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-битным режимом и 256-битным.

Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)

  • PCLMULLQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 00]
  • PCLMULHQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 01]
  • PCLMULLQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 02]
  • PCLMULHQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 03]
  • PCLMULQDQ xmmreg, xmmrm, imm [rmi: 66 0f 3a 44 /r ib]

Применение

Подходит для интенсивных вычислений с плавающей точкой в мультимедиа-программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.

Инструкции и примеры

__m256i _mm256_abs_epi16 (__m256i a)

Synopsis

__m256i _mm256_abs_epi16 (__m256i a)
#include «immintrin.h»
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)

Synopsis

__m256i _mm256_abs_epi32 (__m256i a)
#include «immintrin.h»
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)

Synopsis

__m256i _mm256_abs_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddw
__m256i _mm256_add_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddd
__m256i _mm256_add_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 32-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddq
__m256i _mm256_add_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 64-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddb
__m256i _mm256_add_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddpd
__m256d _mm256_add_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_add_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddps
__m256 _mm256_add_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_add_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpaddsw
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddsb
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddusw
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddusb
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddsubpd
__m256d _mm256_addsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_addsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF (j is even) dst[i+63:i] := a[i+63:i] — b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] + b[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddsubps
__m256 _mm256_addsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_addsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] — b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpalignr
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)

Synopsis

__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
#include «immintrin.h»
Instruction: vpalignr ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by count bytes, and store the low 16 bytes in dst.

Operation

FOR j := 0 to 1 i := j*128 tmp[255:0] := ((a[i+127:i] << 128) OR b[i+127:i]) >> (count[7:0]*8) dst[i+127:i] := tmp[127:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandpd
__m256d _mm256_and_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_and_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandps
__m256 _mm256_and_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_and_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := (a[i+31:i] AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpand
__m256i _mm256_and_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_and_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpand ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandnpd
__m256d _mm256_andnot_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_andnot_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandnpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ((NOT a[i+63:i]) AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandnps
__m256 _mm256_andnot_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_andnot_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandnps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ((NOT a[i+31:i]) AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpandn
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpandn ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.

Operation

dst[255:0] := ((NOT a[255:0]) AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgw
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := (a[i+15:i] + b[i+15:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgb
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := (a[i+7:i] + b[i+7:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpblendw
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 16-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpblendd
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)

Synopsis

__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd xmm, xmm, xmm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpblendd
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vblendpd
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vblendpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[j%8] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vblendps
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vblendps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vpblendvb
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)

Synopsis

__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
#include «immintrin.h»
Instruction: vpblendvb ymm, ymm, ymm, ymm
CPUID Flags: AVX2

Description

Blend packed 8-bit integers from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 IF mask[i+7] dst[i+7:i] := b[i+7:i] ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vblendvpd
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)

Synopsis

__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
#include «immintrin.h»
Instruction: vblendvpd ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vblendvps
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)

Synopsis

__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
#include «immintrin.h»
Instruction: vblendvps ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vbroadcastf128
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

Synopsis

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastf128
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastsd
__m256d _mm256_broadcast_sd (double const * mem_addr)

Synopsis

__m256d _mm256_broadcast_sd (double const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, m64
CPUID Flags: AVX

Description

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

Operation

tmp[63:0] = MEM[mem_addr+63:mem_addr] FOR j := 0 to 3 i := j*64 dst[i+63:i] := tmp[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastss
__m128 _mm_broadcast_ss (float const * mem_addr)

Synopsis

__m128 _mm_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss xmm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 3 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:128] := 0
vbroadcastss
__m256 _mm256_broadcast_ss (float const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss ymm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 7 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vpbroadcastb
__m128i _mm_broadcastb_epi8 (__m128i a)

Synopsis

__m128i _mm_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastb
__m256i _mm256_broadcastb_epi8 (__m128i a)

Synopsis

__m256i _mm256_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastd
__m128i _mm_broadcastd_epi32 (__m128i a)

Synopsis

__m128i _mm_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastd
__m256i _mm256_broadcastd_epi32 (__m128i a)

Synopsis

__m256i _mm256_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastq
__m128i _mm_broadcastq_epi64 (__m128i a)

Synopsis

__m128i _mm_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastq
__m256i _mm256_broadcastq_epi64 (__m128i a)

Synopsis

__m256i _mm256_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
movddup
__m128d _mm_broadcastsd_pd (__m128d a)

Synopsis

__m128d _mm_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: movddup xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
Westmere 1
Nehalem 1
vbroadcastsd
__m256d _mm256_broadcastsd_pd (__m128d a)

Synopsis

__m256d _mm256_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcasti128
__m256i _mm256_broadcastsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_broadcastsi128_si256 (__m128i a)
#include «immintrin.h»
Instruction: vbroadcasti128 ymm, m128
CPUID Flags: AVX2

Description

Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.

Operation

dst[127:0] := a[127:0] dst[255:128] := a[127:0] dst[MAX:256] := 0
vbroadcastss
__m128 _mm_broadcastss_ps (__m128 a)

Synopsis

__m128 _mm_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcastss
__m256 _mm256_broadcastss_ps (__m128 a)

Synopsis

__m256 _mm256_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastw
__m128i _mm_broadcastw_epi16 (__m128i a)

Synopsis

__m128i _mm_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastw
__m256i _mm256_broadcastw_epi16 (__m128i a)

Synopsis

__m256i _mm256_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpslldq
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
__m256 _mm256_castpd_ps (__m256d a)

Synopsis

__m256 _mm256_castpd_ps (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256d to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castpd_si256 (__m256d a)

Synopsis

__m256i _mm256_castpd_si256 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castpd128_pd256 (__m128d a)

Synopsis

__m256d _mm256_castpd128_pd256 (__m128d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128d _mm256_castpd256_pd128 (__m256d a)

Synopsis

__m128d _mm256_castpd256_pd128 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m128d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castps_pd (__m256 a)

Synopsis

__m256d _mm256_castps_pd (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256 to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castps_si256 (__m256 a)

Synopsis

__m256i _mm256_castps_si256 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castps128_ps256 (__m128 a)

Synopsis

__m256 _mm256_castps128_ps256 (__m128 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128 to type __m256; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128 _mm256_castps256_ps128 (__m256 a)

Synopsis

__m128 _mm256_castps256_ps128 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m128. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_castsi128_si256 (__m128i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castsi256_pd (__m256i a)

Synopsis

__m256d _mm256_castsi256_pd (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castsi256_ps (__m256i a)

Synopsis

__m256 _mm256_castsi256_ps (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128i _mm256_castsi256_si128 (__m256i a)

Synopsis

__m128i _mm256_castsi256_si128 (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m128i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
vroundpd
__m256d _mm256_ceil_pd (__m256d a)

Synopsis

__m256d _mm256_ceil_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := CEIL(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_ceil_ps (__m256 a)

Synopsis

__m256 _mm256_ceil_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := CEIL(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmppd
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 1 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmppd
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpps
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpps
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpsd
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmpsd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[63:0] := ( a[63:0] OP b[63:0] ) ? 0xFFFFFFFFFFFFFFFF : 0 dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpss
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpss xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[31:0] := ( a[31:0] OP b[31:0] ) ? 0xFFFFFFFF : 0 dst[127:32] := a[127:32] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vpcmpeqw
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] == b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqd
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqq
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] == b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqb
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] == b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpgtw
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] > b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtd
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] > b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtq
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] > b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpcmpgtb
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] > b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmovsxwd
__m256i _mm256_cvtepi16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 7 i := 32*j k := 16*j dst[i+31:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxwq
__m256i _mm256_cvtepi16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxdq
__m256i _mm256_cvtepi32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxdq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := SignExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtdq2pd
__m256d _mm256_cvtepi32_pd (__m128i a)

Synopsis

__m256d _mm256_cvtepi32_pd (__m128i a)
#include «immintrin.h»
Instruction: vcvtdq2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[m+63:m] := Convert_Int32_To_FP64(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtdq2ps
__m256 _mm256_cvtepi32_ps (__m256i a)

Synopsis

__m256 _mm256_cvtepi32_ps (__m256i a)
#include «immintrin.h»
Instruction: vcvtdq2ps ymm, ymm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpmovsxbw
__m256i _mm256_cvtepi8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbw ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := SignExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbd
__m256i _mm256_cvtepi8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbq
__m256i _mm256_cvtepi8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwd
__m256i _mm256_cvtepu16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 16*j dst[i+31:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwq
__m256i _mm256_cvtepu16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxdq
__m256i _mm256_cvtepu32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxdq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := ZeroExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbw
__m256i _mm256_cvtepu8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbw ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := ZeroExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbd
__m256i _mm256_cvtepu8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbq
__m256i _mm256_cvtepu8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in the low 8 byte sof a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtpd2dq
__m128i _mm256_cvtpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvtpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtpd2ps
__m128 _mm256_cvtpd_ps (__m256d a)

Synopsis

__m128 _mm256_cvtpd_ps (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2ps xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_FP32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtps2dq
__m256i _mm256_cvtps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvtps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvtps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcvtps2pd
__m256d _mm256_cvtps_pd (__m128 a)

Synopsis

__m256d _mm256_cvtps_pd (__m128 a)
#include «immintrin.h»
Instruction: vcvtps2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 32*j dst[i+63:i] := Convert_FP32_To_FP64(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 1
Ivy Bridge 2 1
Sandy Bridge 2 1
vcvttpd2dq
__m128i _mm256_cvttpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvttpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvttpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32_Truncate(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvttps2dq
__m256i _mm256_cvttps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvttps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvttps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32_Truncate(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vdivpd
__m256d _mm256_div_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_div_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vdivpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j dst[i+63:i] := a[i+63:i] / b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 25
Ivy Bridge 35 28
Sandy Bridge 43 44
vdivps
__m256 _mm256_div_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_div_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vdivps ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := a[i+31:i] / b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 13
Ivy Bridge 21 14
Sandy Bridge 29 28
vdpps
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vdpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

Operation

DP(a[127:0], b[127:0], imm8[7:0]) { FOR j := 0 to 3 i := j*32 IF imm8[(4+j)%8] temp[i+31:i] := a[i+31:i] * b[i+31:i] ELSE temp[i+31:i] := 0 FI ENDFOR sum[31:0] := (temp[127:96] + temp[95:64]) + (temp[63:32] + temp[31:0]) FOR j := 0 to 3 i := j*32 IF imm8[j%8] tmpdst[i+31:i] := sum[31:0] ELSE tmpdst[i+31:i] := 0 FI ENDFOR RETURN tmpdst[127:0] } dst[127:0] := DP(a[127:0], b[127:0], imm8[7:0]) dst[255:128] := DP(a[255:128], b[255:128], imm8[7:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 14 2
Ivy Bridge 12 2
Sandy Bridge 12 2
__int16 _mm256_extract_epi16 (__m256i a, const int index)

Synopsis

__int16 _mm256_extract_epi16 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 16-bit integer from a, selected with index, and store the result in dst.

Operation

dst[15:0] := (a[255:0] >> (index * 16))[15:0]
__int32 _mm256_extract_epi32 (__m256i a, const int index)

Synopsis

__int32 _mm256_extract_epi32 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 32-bit integer from a, selected with index, and store the result in dst.

Operation

dst[31:0] := (a[255:0] >> (index * 32))[31:0]
__int64 _mm256_extract_epi64 (__m256i a, const int index)

Synopsis

__int64 _mm256_extract_epi64 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 64-bit integer from a, selected with index, and store the result in dst.

Operation

dst[63:0] := (a[255:0] >> (index * 64))[63:0]
__int8 _mm256_extract_epi8 (__m256i a, const int index)

Synopsis

__int8 _mm256_extract_epi8 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract an 8-bit integer from a, selected with index, and store the result in dst.

Operation

dst[7:0] := (a[255:0] >> (index * 8))[7:0]
vextractf128
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)

Synopsis

__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)

Synopsis

__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextracti128
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextracti128 xmm, ymm, imm
CPUID Flags: AVX2

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vroundpd
__m256d _mm256_floor_pd (__m256d a)

Synopsis

__m256d _mm256_floor_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := FLOOR(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_floor_ps (__m256 a)

Synopsis

__m256 _mm256_floor_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := FLOOR(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vphaddw
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[31:16] + a[15:0] dst[31:16] := a[63:48] + a[47:32] dst[47:32] := a[95:80] + a[79:64] dst[63:48] := a[127:112] + a[111:96] dst[79:64] := b[31:16] + b[15:0] dst[95:80] := b[63:48] + b[47:32] dst[111:96] := b[95:80] + b[79:64] dst[127:112] := b[127:112] + b[111:96] dst[143:128] := a[159:144] + a[143:128] dst[159:144] := a[191:176] + a[175:160] dst[175:160] := a[223:208] + a[207:192] dst[191:176] := a[255:240] + a[239:224] dst[207:192] := b[127:112] + b[143:128] dst[223:208] := b[159:144] + b[175:160] dst[239:224] := b[191:176] + b[207:192] dst[255:240] := b[223:208] + b[239:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphaddd
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vhaddpd
__m256d _mm256_hadd_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hadd_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[127:64] + a[63:0] dst[127:64] := b[127:64] + b[63:0] dst[191:128] := a[255:192] + a[191:128] dst[255:192] := b[255:192] + b[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhaddps
__m256 _mm256_hadd_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hadd_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphaddsw
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[31:16] + a[15:0]) dst[31:16] = Saturate_To_Int16(a[63:48] + a[47:32]) dst[47:32] = Saturate_To_Int16(a[95:80] + a[79:64]) dst[63:48] = Saturate_To_Int16(a[127:112] + a[111:96]) dst[79:64] = Saturate_To_Int16(b[31:16] + b[15:0]) dst[95:80] = Saturate_To_Int16(b[63:48] + b[47:32]) dst[111:96] = Saturate_To_Int16(b[95:80] + b[79:64]) dst[127:112] = Saturate_To_Int16(b[127:112] + b[111:96]) dst[143:128] = Saturate_To_Int16(a[159:144] + a[143:128]) dst[159:144] = Saturate_To_Int16(a[191:176] + a[175:160]) dst[175:160] = Saturate_To_Int16( a[223:208] + a[207:192]) dst[191:176] = Saturate_To_Int16(a[255:240] + a[239:224]) dst[207:192] = Saturate_To_Int16(b[127:112] + b[143:128]) dst[223:208] = Saturate_To_Int16(b[159:144] + b[175:160]) dst[239:224] = Saturate_To_Int16(b[191-160] + b[159-128]) dst[255:240] = Saturate_To_Int16(b[255:240] + b[239:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphsubw
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[15:0] — a[31:16] dst[31:16] := a[47:32] — a[63:48] dst[47:32] := a[79:64] — a[95:80] dst[63:48] := a[111:96] — a[127:112] dst[79:64] := b[15:0] — b[31:16] dst[95:80] := b[47:32] — b[63:48] dst[111:96] := b[79:64] — b[95:80] dst[127:112] := b[111:96] — b[127:112] dst[143:128] := a[143:128] — a[159:144] dst[159:144] := a[175:160] — a[191:176] dst[175:160] := a[207:192] — a[223:208] dst[191:176] := a[239:224] — a[255:240] dst[207:192] := b[143:128] — b[159:144] dst[223:208] := b[175:160] — b[191:176] dst[239:224] := b[207:192] — b[223:208] dst[255:240] := b[239:224] — b[255:240] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vphsubd
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vhsubpd
__m256d _mm256_hsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[63:0] — a[127:64] dst[127:64] := b[63:0] — b[127:64] dst[191:128] := a[191:128] — a[255:192] dst[255:192] := b[191:128] — b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhsubps
__m256 _mm256_hsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphsubsw
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[15:0] — a[31:16]) dst[31:16] = Saturate_To_Int16(a[47:32] — a[63:48]) dst[47:32] = Saturate_To_Int16(a[79:64] — a[95:80]) dst[63:48] = Saturate_To_Int16(a[111:96] — a[127:112]) dst[79:64] = Saturate_To_Int16(b[15:0] — b[31:16]) dst[95:80] = Saturate_To_Int16(b[47:32] — b[63:48]) dst[111:96] = Saturate_To_Int16(b[79:64] — b[95:80]) dst[127:112] = Saturate_To_Int16(b[111:96] — b[127:112]) dst[143:128]= Saturate_To_Int16(a[143:128] — a[159:144]) dst[159:144] = Saturate_To_Int16(a[175:160] — a[191:176]) dst[175:160] = Saturate_To_Int16(a[207:192] — a[223:208]) dst[191:176] = Saturate_To_Int16(a[239:224] — a[255:240]) dst[207:192] = Saturate_To_Int16(b[143:128] — b[159:144]) dst[223:208] = Saturate_To_Int16(b[175:160] — b[191:176]) dst[239:224] = Saturate_To_Int16(b[207:192] — b[223:208]) dst[255:240] = Saturate_To_Int16(b[239:224] — b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpgatherdd
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256imask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256mask, const int scale)

Synopsis

__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
vpgatherqd
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
vpgatherqq
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128mask, const int scale)

Synopsis

__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)

Synopsis

__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*16 dst[sel+15:sel] := i[15:0]
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)

Synopsis

__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*32 dst[sel+31:sel] := i[31:0]
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)

Synopsis

__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*64 dst[sel+63:sel] := i[63:0]
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)

Synopsis

__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*8 dst[sel+7:sel] := i[7:0]
vinsertf128
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)

Synopsis

__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE imm8[7:0] of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)

Synopsis

__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)

Synopsis

__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinserti128
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)

Synopsis

__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vinserti128 ymm, ymm, xmm, imm
CPUID Flags: AVX2

Description

Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vlddqu
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vlddqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovapd
__m256d _mm256_load_pd (double const * mem_addr)

Synopsis

__m256d _mm256_load_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovapd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovaps
__m256 _mm256_load_ps (float const * mem_addr)

Synopsis

__m256 _mm256_load_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovaps ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqa
__m256i _mm256_load_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_load_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqa ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovupd
__m256d _mm256_loadu_pd (double const * mem_addr)

Synopsis

__m256d _mm256_loadu_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovupd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovups
__m256 _mm256_loadu_ps (float const * mem_addr)

Synopsis

__m256 _mm256_loadu_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovups ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqu
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)

Synopsis

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)

Synopsis

__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

Synopsis

__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
vpmaddwd
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i+16]*b[i+31:i+16] + a[i+15:i]*b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaddubsw
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaskmovd
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovd xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovd
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovd ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovq xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovq ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vmaskmovpd
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)

Synopsis

__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovpd xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovpd
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)

Synopsis

__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovpd ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)

Synopsis

__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovps xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)

Synopsis

__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovps ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vpmaskmovd
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovd m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovd
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovd m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovq m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovq m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vmaskmovpd
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)

Synopsis

void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
#include «immintrin.h»
Instruction: vmaskmovpd m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovpd
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)

Synopsis

void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
#include «immintrin.h»
Instruction: vmaskmovpd m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)

Synopsis

void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
#include «immintrin.h»
Instruction: vmaskmovps m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)

Synopsis

void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
#include «immintrin.h»
Instruction: vmaskmovps m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vpmaxsw
__m256i _mm256_max_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxsd
__m256i _mm256_max_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxsb
__m256i _mm256_max_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxuw
__m256i _mm256_max_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxud
__m256i _mm256_max_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxub
__m256i _mm256_max_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmaxpd
__m256d _mm256_max_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_max_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmaxpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MAX(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmaxps
__m256 _mm256_max_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_max_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmaxps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MAX(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpminsw
__m256i _mm256_min_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsd
__m256i _mm256_min_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsb
__m256i _mm256_min_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminuw
__m256i _mm256_min_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminud
__m256i _mm256_min_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminub
__m256i _mm256_min_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vminpd
__m256d _mm256_min_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_min_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vminpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MIN(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vminps
__m256 _mm256_min_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_min_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vminps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MIN(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmovddup
__m256d _mm256_movedup_pd (__m256d a)

Synopsis

__m256d _mm256_movedup_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovddup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.

Operation

dst[63:0] := a[63:0] dst[127:64] := a[63:0] dst[191:128] := a[191:128] dst[255:192] := a[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovshdup
__m256 _mm256_movehdup_ps (__m256 a)

Synopsis

__m256 _mm256_movehdup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovshdup ymm, ymm
CPUID Flags: AVX

Description

Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[63:32] dst[63:32] := a[63:32] dst[95:64] := a[127:96] dst[127:96] := a[127:96] dst[159:128] := a[191:160] dst[191:160] := a[191:160] dst[223:192] := a[255:224] dst[255:224] := a[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovsldup
__m256 _mm256_moveldup_ps (__m256 a)

Synopsis

__m256 _mm256_moveldup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovsldup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[31:0] dst[63:32] := a[31:0] dst[95:64] := a[95:64] dst[127:96] := a[95:64] dst[159:128] := a[159:128] dst[191:160] := a[159:128] dst[223:192] := a[223:192] dst[255:224] := a[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpmovmskb
int _mm256_movemask_epi8 (__m256i a)

Synopsis

int _mm256_movemask_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpmovmskb r32, ymm
CPUID Flags: AVX2

Description

Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[j] := a[i+7] ENDFOR

Performance

Architecture Latency Throughput
Haswell 3
vmovmskpd
int _mm256_movemask_pd (__m256d a)

Synopsis

int _mm256_movemask_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovmskpd r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

Operation

FOR j := 0 to 3 i := j*64 IF a[i+63] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:4] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmovmskps
int _mm256_movemask_ps (__m256 a)

Synopsis

int _mm256_movemask_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovmskps r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:8] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmpsadbw
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vmpsadbw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from bstarting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.

Operation

MPSADBW(a[127:0], b[127:0], imm8[2:0]) { a_offset := imm8[2]*32 b_offset := imm8[1:0]*32 FOR j := 0 to 7 i := j*8 k := a_offset+i l := b_offset tmp[i+15:i] := ABS(a[k+7:k] — b[l+7:l]) + ABS(a[k+15:k+8] — b[l+15:l+8]) + ABS(a[k+23:k+16] — b[l+23:l+16]) + ABS(a[k+31:k+24] — b[l+31:l+24]) ENDFOR RETURN tmp[127:0] } dst[127:0] := MPSADBW(a[127:0], b[127:0], imm8[2:0]) dst[255:128] := MPSADBW(a[255:128], b[255:128], imm8[5:3]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 2
vpmuldq
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmuludq
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuludq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vmulpd
__m256d _mm256_mul_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_mul_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmulpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] * b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vmulps
__m256 _mm256_mul_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_mul_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmulps ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vpmulhw
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulhuw
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
vpmulhrsw
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhrsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1 dst[i+15:i] := tmp[16:1] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmullw
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmullw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulld
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulld ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 7 i := j*32 tmp[63:0] := a[i+31:i] * b[i+31:i] dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 10 1
vorpd
__m256d _mm256_or_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_or_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] BITWISE OR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vorps
__m256 _mm256_or_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_or_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] BITWISE OR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpor
__m256i _mm256_or_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_or_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] OR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpacksswb
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpacksswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_Int8 (a[15:0]) dst[15:8] := Saturate_Int16_To_Int8 (a[31:16]) dst[23:16] := Saturate_Int16_To_Int8 (a[47:32]) dst[31:24] := Saturate_Int16_To_Int8 (a[63:48]) dst[39:32] := Saturate_Int16_To_Int8 (a[79:64]) dst[47:40] := Saturate_Int16_To_Int8 (a[95:80]) dst[55:48] := Saturate_Int16_To_Int8 (a[111:96]) dst[63:56] := Saturate_Int16_To_Int8 (a[127:112]) dst[71:64] := Saturate_Int16_To_Int8 (b[15:0]) dst[79:72] := Saturate_Int16_To_Int8 (b[31:16]) dst[87:80] := Saturate_Int16_To_Int8 (b[47:32]) dst[95:88] := Saturate_Int16_To_Int8 (b[63:48]) dst[103:96] := Saturate_Int16_To_Int8 (b[79:64]) dst[111:104] := Saturate_Int16_To_Int8 (b[95:80]) dst[119:112] := Saturate_Int16_To_Int8 (b[111:96]) dst[127:120] := Saturate_Int16_To_Int8 (b[127:112]) dst[135:128] := Saturate_Int16_To_Int8 (a[143:128]) dst[143:136] := Saturate_Int16_To_Int8 (a[159:144]) dst[151:144] := Saturate_Int16_To_Int8 (a[175:160]) dst[159:152] := Saturate_Int16_To_Int8 (a[191:176]) dst[167:160] := Saturate_Int16_To_Int8 (a[207:192]) dst[175:168] := Saturate_Int16_To_Int8 (a[223:208]) dst[183:176] := Saturate_Int16_To_Int8 (a[239:224]) dst[191:184] := Saturate_Int16_To_Int8 (a[255:240]) dst[199:192] := Saturate_Int16_To_Int8 (b[143:128]) dst[207:200] := Saturate_Int16_To_Int8 (b[159:144]) dst[215:208] := Saturate_Int16_To_Int8 (b[175:160]) dst[223:216] := Saturate_Int16_To_Int8 (b[191:176]) dst[231:224] := Saturate_Int16_To_Int8 (b[207:192]) dst[239:232] := Saturate_Int16_To_Int8 (b[223:208]) dst[247:240] := Saturate_Int16_To_Int8 (b[239:224]) dst[255:248] := Saturate_Int16_To_Int8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpackssdw
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackssdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_Int16 (a[31:0]) dst[31:16] := Saturate_Int32_To_Int16 (a[63:32]) dst[47:32] := Saturate_Int32_To_Int16 (a[95:64]) dst[63:48] := Saturate_Int32_To_Int16 (a[127:96]) dst[79:64] := Saturate_Int32_To_Int16 (b[31:0]) dst[95:80] := Saturate_Int32_To_Int16 (b[63:32]) dst[111:96] := Saturate_Int32_To_Int16 (b[95:64]) dst[127:112] := Saturate_Int32_To_Int16 (b[127:96]) dst[143:128] := Saturate_Int32_To_Int16 (a[159:128]) dst[159:144] := Saturate_Int32_To_Int16 (a[191:160]) dst[175:160] := Saturate_Int32_To_Int16 (a[223:192]) dst[191:176] := Saturate_Int32_To_Int16 (a[255:224]) dst[207:192] := Saturate_Int32_To_Int16 (b[159:128]) dst[223:208] := Saturate_Int32_To_Int16 (b[191:160]) dst[239:224] := Saturate_Int32_To_Int16 (b[223:192]) dst[255:240] := Saturate_Int32_To_Int16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackuswb
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackuswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_UnsignedInt8 (a[15:0]) dst[15:8] := Saturate_Int16_To_UnsignedInt8 (a[31:16]) dst[23:16] := Saturate_Int16_To_UnsignedInt8 (a[47:32]) dst[31:24] := Saturate_Int16_To_UnsignedInt8 (a[63:48]) dst[39:32] := Saturate_Int16_To_UnsignedInt8 (a[79:64]) dst[47:40] := Saturate_Int16_To_UnsignedInt8 (a[95:80]) dst[55:48] := Saturate_Int16_To_UnsignedInt8 (a[111:96]) dst[63:56] := Saturate_Int16_To_UnsignedInt8 (a[127:112]) dst[71:64] := Saturate_Int16_To_UnsignedInt8 (b[15:0]) dst[79:72] := Saturate_Int16_To_UnsignedInt8 (b[31:16]) dst[87:80] := Saturate_Int16_To_UnsignedInt8 (b[47:32]) dst[95:88] := Saturate_Int16_To_UnsignedInt8 (b[63:48]) dst[103:96] := Saturate_Int16_To_UnsignedInt8 (b[79:64]) dst[111:104] := Saturate_Int16_To_UnsignedInt8 (b[95:80]) dst[119:112] := Saturate_Int16_To_UnsignedInt8 (b[111:96]) dst[127:120] := Saturate_Int16_To_UnsignedInt8 (b[127:112]) dst[135:128] := Saturate_Int16_To_UnsignedInt8 (a[143:128]) dst[143:136] := Saturate_Int16_To_UnsignedInt8 (a[159:144]) dst[151:144] := Saturate_Int16_To_UnsignedInt8 (a[175:160]) dst[159:152] := Saturate_Int16_To_UnsignedInt8 (a[191:176]) dst[167:160] := Saturate_Int16_To_UnsignedInt8 (a[207:192]) dst[175:168] := Saturate_Int16_To_UnsignedInt8 (a[223:208]) dst[183:176] := Saturate_Int16_To_UnsignedInt8 (a[239:224]) dst[191:184] := Saturate_Int16_To_UnsignedInt8 (a[255:240]) dst[199:192] := Saturate_Int16_To_UnsignedInt8 (b[143:128]) dst[207:200] := Saturate_Int16_To_UnsignedInt8 (b[159:144]) dst[215:208] := Saturate_Int16_To_UnsignedInt8 (b[175:160]) dst[223:216] := Saturate_Int16_To_UnsignedInt8 (b[191:176]) dst[231:224] := Saturate_Int16_To_UnsignedInt8 (b[207:192]) dst[239:232] := Saturate_Int16_To_UnsignedInt8 (b[223:208]) dst[247:240] := Saturate_Int16_To_UnsignedInt8 (b[239:224]) dst[255:248] := Saturate_Int16_To_UnsignedInt8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackusdw
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackusdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_UnsignedInt16 (a[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (a[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (a[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (a[127:96]) dst[79:64] := Saturate_Int32_To_UnsignedInt16 (b[31:0]) dst[95:80] := Saturate_Int32_To_UnsignedInt16 (b[63:32]) dst[111:96] := Saturate_Int32_To_UnsignedInt16 (b[95:64]) dst[127:112] := Saturate_Int32_To_UnsignedInt16 (b[127:96]) dst[143:128] := Saturate_Int32_To_UnsignedInt16 (a[159:128]) dst[159:144] := Saturate_Int32_To_UnsignedInt16 (a[191:160]) dst[175:160] := Saturate_Int32_To_UnsignedInt16 (a[223:192]) dst[191:176] := Saturate_Int32_To_UnsignedInt16 (a[255:224]) dst[207:192] := Saturate_Int32_To_UnsignedInt16 (b[159:128]) dst[223:208] := Saturate_Int32_To_UnsignedInt16 (b[191:160]) dst[239:224] := Saturate_Int32_To_UnsignedInt16 (b[223:192]) dst[255:240] := Saturate_Int32_To_UnsignedInt16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpermilpd
__m128d _mm_permute_pd (__m128d a, int imm8)

Synopsis

__m128d _mm_permute_pd (__m128d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permute_pd (__m256d a, int imm8)

Synopsis

__m256d _mm256_permute_pd (__m256d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] IF (imm8[2] == 0) dst[191:128] := a[191:128] IF (imm8[2] == 1) dst[191:128] := a[255:192] IF (imm8[3] == 0) dst[255:192] := a[191:128] IF (imm8[3] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m128 _mm_permute_ps (__m128 a, int imm8)

Synopsis

__m128 _mm_permute_ps (__m128 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permute_ps (__m256 a, int imm8)

Synopsis

__m256 _mm256_permute_ps (__m256 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vperm2f128
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)

Synopsis

__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)

Synopsis

__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)

Synopsis

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2i128
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vperm2i128 ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermq
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpermq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermpd
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)

Synopsis

__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vpermpd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vpermilpd
__m128d _mm_permutevar_pd (__m128d a, __m128i b)

Synopsis

__m128d _mm_permutevar_pd (__m128d a, __m128i b)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)

Synopsis

__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] IF (b[129] == 0) dst[191:128] := a[191:128] IF (b[129] == 1) dst[191:128] := a[255:192] IF (b[193] == 0) dst[255:192] := a[191:128] IF (b[193] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermilps
__m128 _mm_permutevar_ps (__m128 a, __m128i b)

Synopsis

__m128 _mm_permutevar_ps (__m128 a, __m128i b)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)

Synopsis

__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[159:128] := SELECT4(a[255:128], b[129:128]) dst[191:160] := SELECT4(a[255:128], b[161:160]) dst[223:192] := SELECT4(a[255:128], b[193:192]) dst[255:224] := SELECT4(a[255:128], b[225:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermd
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)

Synopsis

__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
#include «immintrin.h»
Instruction: vpermd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermps
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Synopsis

__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
#include «immintrin.h»
Instruction: vpermps ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vrcpps
__m256 _mm256_rcp_ps (__m256 a)

Synopsis

__m256 _mm256_rcp_ps (__m256 a)
#include «immintrin.h»
Instruction: vrcpps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0/a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vroundpd
__m256d _mm256_round_pd (__m256d a, int rounding)

Synopsis

__m256d _mm256_round_pd (__m256d a, int rounding)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ROUND(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_round_ps (__m256 a, int rounding)

Synopsis

__m256 _mm256_round_ps (__m256 a, int rounding)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ROUND(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vrsqrtps
__m256 _mm256_rsqrt_ps (__m256 a)

Synopsis

__m256 _mm256_rsqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vrsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0 / SQRT(a[i+31:i])) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vpsadbw
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsadbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

Operation

FOR j := 0 to 31 i := j*8 tmp[i+7:i] := ABS(a[i+7:i] — b[i+7:i]) ENDFOR FOR j := 0 to 4 i := j*64 dst[i+15:i] := tmp[i+7:i] + tmp[i+15:i+8] + tmp[i+23:i+16] + tmp[i+31:i+24] + tmp[i+39:i+32] + tmp[i+47:i+40] + tmp[i+55:i+48] + tmp[i+63:i+56] dst[i+63:i+16] := 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values.

Operation

dst[15:0] := e0 dst[31:16] := e1 dst[47:32] := e2 dst[63:48] := e3 dst[79:64] := e4 dst[95:80] := e5 dst[111:96] := e6 dst[127:112] := e7 dst[145:128] := e8 dst[159:144] := e9 dst[175:160] := e10 dst[191:176] := e11 dst[207:192] := e12 dst[223:208] := e13 dst[239:224] := e14 dst[255:240] := e15 dst[MAX:256] := 0
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e0 dst[15:8] := e1 dst[23:16] := e2 dst[31:24] := e3 dst[39:32] := e4 dst[47:40] := e5 dst[55:48] := e6 dst[63:56] := e7 dst[71:64] := e8 dst[79:72] := e9 dst[87:80] := e10 dst[95:88] := e11 dst[103:96] := e12 dst[111:104] := e13 dst[119:112] := e14 dst[127:120] := e15 dst[135:128] := e16 dst[143:136] := e17 dst[151:144] := e18 dst[159:152] := e19 dst[167:160] := e20 dst[175:168] := e21 dst[183:176] := e22 dst[191:184] := e23 dst[199:192] := e24 dst[207:200] := e25 dst[215:208] := e26 dst[223:216] := e27 dst[231:224] := e28 dst[239:232] := e29 dst[247:240] := e30 dst[255:248] := e31 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)

Synopsis

__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)

Synopsis

__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)

Synopsis

__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set1_epi16 (short a)

Synopsis

__m256i _mm256_set1_epi16 (short a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi32 (int a)

Synopsis

__m256i _mm256_set1_epi32 (int a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi64x (long long a)

Synopsis

__m256i _mm256_set1_epi64x (long long a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi8 (char a)

Synopsis

__m256i _mm256_set1_epi8 (char a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
__m256d _mm256_set1_pd (double a)

Synopsis

__m256d _mm256_set1_pd (double a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast double-precision (64-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256 _mm256_set1_ps (float a)

Synopsis

__m256 _mm256_set1_ps (float a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast single-precision (32-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values in reverse order.

Operation

dst[15:0] := e15 dst[31:16] := e14 dst[47:32] := e13 dst[63:48] := e12 dst[79:64] := e11 dst[95:80] := e10 dst[111:96] := e9 dst[127:112] := e8 dst[145:128] := e7 dst[159:144] := e6 dst[175:160] := e5 dst[191:176] := e4 dst[207:192] := e3 dst[223:208] := e2 dst[239:224] := e1 dst[255:240] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e31 dst[15:8] := e30 dst[23:16] := e29 dst[31:24] := e28 dst[39:32] := e27 dst[47:40] := e26 dst[55:48] := e25 dst[63:56] := e24 dst[71:64] := e23 dst[79:72] := e22 dst[87:80] := e21 dst[95:88] := e20 dst[103:96] := e19 dst[111:104] := e18 dst[119:112] := e17 dst[127:120] := e16 dst[135:128] := e15 dst[143:136] := e14 dst[151:144] := e13 dst[159:152] := e12 dst[167:160] := e11 dst[175:168] := e10 dst[183:176] := e9 dst[191:184] := e8 dst[199:192] := e7 dst[207:200] := e6 dst[215:208] := e5 dst[223:216] := e4 dst[231:224] := e3 dst[239:232] := e2 dst[247:240] := e1 dst[255:248] := e0 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)

Synopsis

__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)

Synopsis

__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)

Synopsis

__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
vxorpd
__m256d _mm256_setzero_pd (void)

Synopsis

__m256d _mm256_setzero_pd (void)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256d with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_setzero_ps (void)

Synopsis

__m256 _mm256_setzero_ps (void)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256 with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_setzero_si256 (void)

Synopsis

__m256i _mm256_setzero_si256 (void)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256i with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpshufd
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshufb
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpshufb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vshufpd
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vshufpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

dst[63:0] := (imm8[0] == 0) ? a[63:0] : a[127:64] dst[127:64] := (imm8[1] == 0) ? b[63:0] : b[127:64] dst[191:128] := (imm8[2] == 0) ? a[191:128] : a[255:192] dst[255:192] := (imm8[3] == 0) ? b[191:128] : b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vshufps
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vshufps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(b[127:0], imm8[5:4]) dst[127:96] := SELECT4(b[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(b[255:128], imm8[5:4]) dst[255:224] := SELECT4(b[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpshufhw
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufhw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[63:0] := a[63:0] dst[79:64] := (a >> (imm8[1:0] * 16))[79:64] dst[95:80] := (a >> (imm8[3:2] * 16))[79:64] dst[111:96] := (a >> (imm8[5:4] * 16))[79:64] dst[127:112] := (a >> (imm8[7:6] * 16))[79:64] dst[191:128] := a[191:128] dst[207:192] := (a >> (imm8[1:0] * 16))[207:192] dst[223:208] := (a >> (imm8[3:2] * 16))[207:192] dst[239:224] := (a >> (imm8[5:4] * 16))[207:192] dst[255:240] := (a >> (imm8[7:6] * 16))[207:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshuflw
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshuflw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[15:0] := (a >> (imm8[1:0] * 16))[15:0] dst[31:16] := (a >> (imm8[3:2] * 16))[15:0] dst[47:32] := (a >> (imm8[5:4] * 16))[15:0] dst[63:48] := (a >> (imm8[7:6] * 16))[15:0] dst[127:64] := a[127:64] dst[143:128] := (a >> (imm8[1:0] * 16))[143:128] dst[159:144] := (a >> (imm8[3:2] * 16))[143:128] dst[175:160] := (a >> (imm8[5:4] * 16))[143:128] dst[191:176] := (a >> (imm8[7:6] * 16))[143:128] dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpsignw
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 15 i := j*16 IF b[i+15:i] < 0 dst[i+15:i] := NEG(a[i+15:i]) ELSE IF b[i+15:i] = 0 dst[i+15:i] := 0 ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignd
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 7 i := j*32 IF b[i+31:i] < 0 dst[i+31:i] := NEG(a[i+31:i]) ELSE IF b[i+31:i] = 0 dst[i+31:i] := 0 ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignb
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 31 i := j*8 IF b[i+7:i] < 0 dst[i+7:i] := NEG(a[i+7:i]) ELSE IF b[i+7:i] = 0 dst[i+7:i] := 0 ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsllw
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
Haswell 4
vpslld
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllq
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllw
__m256i _mm256_slli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Haswell 1
vpslld
__m256i _mm256_slli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllq
__m256i _mm256_slli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpslldq
__m256i _mm256_slli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_slli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvd
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvd
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvq
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvq
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsqrtpd
__m256d _mm256_sqrt_pd (__m256d a)

Synopsis

__m256d _mm256_sqrt_pd (__m256d a)
#include «immintrin.h»
Instruction: vsqrtpd ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := SQRT(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 28
Ivy Bridge 35 28
Sandy Bridge 43 44
vsqrtps
__m256 _mm256_sqrt_ps (__m256 a)

Synopsis

__m256 _mm256_sqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SQRT(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 14
Ivy Bridge 21 14
Sandy Bridge 29 28
vpsraw
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrad
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsraw
__m256i _mm256_srai_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrad
__m256i _mm256_srai_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsravd
__m128i _mm_srav_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srav_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsravd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsravd
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsravd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlw
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrld
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlq
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlw
__m256i _mm256_srli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrld
__m256i _mm256_srli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlq
__m256i _mm256_srli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_srli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_srli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvd
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvq
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvq
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmovapd
void _mm256_store_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_store_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovapd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovaps
void _mm256_store_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_store_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovaps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqa
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqa m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovupd
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_storeu_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovupd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovups
void _mm256_storeu_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_storeu_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovups m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqu
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqu m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)

Synopsis

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)

Synopsis

void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Synopsis

void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
vmovntdqa
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)

Synopsis

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
#include «immintrin.h»
Instruction: vmovntdqa ymm, m256
CPUID Flags: AVX2

Description

Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovntpd
void _mm256_stream_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_stream_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovntpd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntps
void _mm256_stream_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_stream_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovntps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntdq
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovntdq m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vpsubw
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] — b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubd
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubq
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubb
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] — b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsubpd
__m256d _mm256_sub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_sub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vsubps
__m256 _mm256_sub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_sub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpsubsw
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubsb
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusw
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusb
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vtestpd
int _mm_testc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testnzc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testnzc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testnzc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testnzc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testnzc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testnzc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testnzc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testnzc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testnzc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testnzc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testz_pd (__m128d a, __m128d b)

Synopsis

int _mm_testz_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testz_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testz_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testz_ps (__m128 a, __m128 b)

Synopsis

int _mm_testz_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testz_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testz_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testz_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testz_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
__m256d _mm256_undefined_pd (void)

Synopsis

__m256d _mm256_undefined_pd (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256d with undefined elements.
__m256 _mm256_undefined_ps (void)

Synopsis

__m256 _mm256_undefined_ps (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256 with undefined elements.
__m256i _mm256_undefined_si256 (void)

Synopsis

__m256i _mm256_undefined_si256 (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256i with undefined elements.
vpunpckhwd
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[79:64] dst[31:16] := src2[79:64] dst[47:32] := src1[95:80] dst[63:48] := src2[95:80] dst[79:64] := src1[111:96] dst[95:80] := src2[111:96] dst[111:96] := src1[127:112] dst[127:112] := src2[127:112] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhdq
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhqdq
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhbw
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[71:64] dst[15:8] := src2[71:64] dst[23:16] := src1[79:72] dst[31:24] := src2[79:72] dst[39:32] := src1[87:80] dst[47:40] := src2[87:80] dst[55:48] := src1[95:88] dst[63:56] := src2[95:88] dst[71:64] := src1[103:96] dst[79:72] := src2[103:96] dst[87:80] := src1[111:104] dst[95:88] := src2[111:104] dst[103:96] := src1[119:112] dst[111:104] := src2[119:112] dst[119:112] := src1[127:120] dst[127:120] := src2[127:120] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpckhpd
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpckhpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpckhps
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpckhps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpunpcklwd
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[15:0] dst[31:16] := src2[15:0] dst[47:32] := src1[31:16] dst[63:48] := src2[31:16] dst[79:64] := src1[47:32] dst[95:80] := src2[47:32] dst[111:96] := src1[63:48] dst[127:112] := src2[63:48] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckldq
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklqdq
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklbw
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[7:0] dst[15:8] := src2[7:0] dst[23:16] := src1[15:8] dst[31:24] := src2[15:8] dst[39:32] := src1[23:16] dst[47:40] := src2[23:16] dst[55:48] := src1[31:24] dst[63:56] := src2[31:24] dst[71:64] := src1[39:32] dst[79:72] := src2[39:32] dst[87:80] := src1[47:40] dst[95:88] := src2[47:40] dst[103:96] := src1[55:48] dst[111:104] := src2[55:48] dst[119:112] := src1[63:56] dst[127:120] := src2[63:56] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpcklpd
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpcklpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpcklps
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpcklps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorpd
__m256d _mm256_xor_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_xor_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] XOR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_xor_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_xor_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] XOR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_xor_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_xor_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] XOR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vzeroall
void _mm256_zeroall (void)

Synopsis

void _mm256_zeroall (void)
#include «immintrin.h»
Instruction: vzeroall
CPUID Flags: AVX

Description

Zero the contents of all XMM or YMM registers.

Operation

YMM0[MAX:0] := 0 YMM1[MAX:0] := 0 YMM2[MAX:0] := 0 YMM3[MAX:0] := 0 YMM4[MAX:0] := 0 YMM5[MAX:0] := 0 YMM6[MAX:0] := 0 YMM7[MAX:0] := 0 IF 64-bit mode YMM8[MAX:0] := 0 YMM9[MAX:0] := 0 YMM10[MAX:0] := 0 YMM11[MAX:0] := 0 YMM12[MAX:0] := 0 YMM13[MAX:0] := 0 YMM14[MAX:0] := 0 YMM15[MAX:0] := 0 FI
vzeroupper
void _mm256_zeroupper (void)

Synopsis

void _mm256_zeroupper (void)
#include «immintrin.h»
Instruction: vzeroupper
CPUID Flags: AVX

Description

Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.

Operation

YMM0[MAX:128] := 0 YMM1[MAX:128] := 0 YMM2[MAX:128] := 0 YMM3[MAX:128] := 0 YMM4[MAX:128] := 0 YMM5[MAX:128] := 0 YMM6[MAX:128] := 0 YMM7[MAX:128] := 0 IF 64-bit mode YMM8[MAX:128] := 0 YMM9[MAX:128] := 0 YMM10[MAX:128] := 0 YMM11[MAX:128] := 0 YMM12[MAX:128] := 0 YMM13[MAX:128] := 0 YMM14[MAX:128] := 0 YMM15[MAX:128] := 0 FI

Performance

Architecture Latency Throughput
Haswell 0 1
Ivy Bridge 0 1
Sandy Bridge 0 1

SSE4 — CPU instruction set used in the Intel Core microarchitecture and AMD

SSE4 — набор команд микроархитектуры Intel Core

SSE4 состоит из 54 инструкций, 47 из них относят к SSE4.1 (они есть в процессорах Penryn). Полный набор команд (SSE4.1 и SSE4.2, то есть 47 + оставшиеся 7 команд) доступен в процессорах Intel с микроархитектурой Nehalem, которые были выпущены в середине ноября 2008 года и более поздних редакциях. Ни одна из SSE4 инструкций не работает с 64-х битными mmx регистрами (только со 128-ми битными xmm0-15).

Компилятор языка Си от Intel начиная с версии 10 генерирует инструкции SSE4 при задании опции -QxS. Компилятор Sun Studio от Sun Microsystems с версии 12 update 1 генерирует инструкции SSE4 с помощью опций -xarch=sse4_1 (SSE4.1) и -xarch=sse4_2 (SSE4.2). Компилятор GCC поддерживает SSE4.1 и SSE4.2 с версии 4.3, опции -msse4.1 и -msse4.2, или -msse4, включающая оба варианта.

Инструкции SSE4.1
Ускорение видео

  • MPSADBW xmm1, xmm2/m128, imm8 — (Multiple Packed Sums of Absolute Difference)
    • Input — { A0, A1,… A14 }, { B0, B1,… B15 }, Shiftmode
    • Output — { SAD0, SAD1, SAD2,… SAD7 }

Вычисление восьми сумм абсолютных значений разностей (SAD) смещённых 4-х байтных беззнаковых групп. Расположение операндов для 16-ти битных SAD определяется 3-мя битами непосредственного аргумента imm8.

s1 = imm8[2]*4
s2 = imm8[1:0]*4
SAD0 = |A(s1+0)-B(s2+0)| + |A(s1+1)-B(s2+1)| + |A(s1+2)-B(s2+2)| + |A(s1+3)-B(s2+3)|
SAD1 = |A(s1+1)-B(s2+0)| + |A(s1+2)-B(s2+1)| + |A(s1+3)-B(s2+2)| + |A(s1+4)-B(s2+3)|
SAD2 = |A(s1+2)-B(s2+0)| + |A(s1+3)-B(s2+1)| + |A(s1+4)-B(s2+2)| + |A(s1+5)-B(s2+3)|
...
SAD7 = |A(s1+7)-B(s2+0)| + |A(s1+8)-B(s2+1)| + |A(s1+9)-B(s2+2)| + |A(s1+10)-B(s2+3)|
  • PHMINPOSUW xmm1, xmm2/m128 — (Packed Horizontal Word Minimum)
    • Input — { A0, A1,… A7 }
    • Output — { MinVal, MinPos, 0, 0… }

Поиск среди 16-ти битных беззнаковых полей A0…A7 такого, который имеет минимальное значение (и позицию с меньшим номером, если таких полей несколько). Возвращается 16-ти битное значение и его позиция.

  • PMOV{SX,ZX}{B,W,D} xmm1, xmm2/m{64,32,16} — (Packed Move with Sign/Zero Extend)

Группа из 12-ти инструкций для расширения формата упакованных полей. Упакованные 8, 16, или 32-х битные поля из младшей части аргумента расширяются (со знаком или без) в 16, 32 или 64-х битные поля результата.

Входной формат Результирующий
формат
8 бит 16 бит 32 бита
PMOVSXBW 16 бит
PMOVZXBW
PMOVSXBD PMOVSXWD 32 бита
PMOVZXBD PMOVZXWD
PMOVSXBQ PMOVSXWQ PMOVSXDQ 64 бита
PMOVZXBQ PMOVZXWQ PMOVZXDQ

Векторные примитивы

  • P{MIN,MAX}{SB,UW,SD,UD} xmm1, xmm2/m128 — (Minimum/Maximum of Packed Signed/Unsigned Byte/Word/DWord Integers)

Каждое поле результата есть минимальное/максимальное значение соответствующих полей двух аргументов. Байтовые поля рассматриваются только как числа со знаком, 16-ти битные — только как числа без знака. Для 32-х битных упакованных полей предусмотрен вариант как со знаком, так и без.

  • PMULDQ xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { A0*B0, A2*B2 }
      Перемножение 32-х битных полей со знаком с выдачей полных 64-х бит результата (две операции умножения над 0 и 2 полями аргументов).
  • PMULLD xmm1, xmm2/m128 — (Multiply Packed Signed Dword Integers and Store Low Result)
    • Input — { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
    • Output — { low32(A0*B0), low32(A1*B1), low32(A2*B2), low32(A3*B3)
      Перемножение 32-х битных полей со знаком с выдачей младших 32-х бит результатов (четыре операции умножения над всеми полями аргументов).
  • PACKUSDW xmm1, xmm2/m128 — (Pack with Unsigned Saturation)
    Упаковка 32-х битных полей со знаком в 16-ти битные поля без знака с насыщением.
  • PCMPEQQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Equal)
    Проверка 64-х битных полей на равенство и выдача 64-х битных масок.

    ставки/извлечения

    • INSERTPS xmm1, xmm2/m32, imm8 — (Insert Packed Single Precision Floating-Point Value)

    Вставка 32-х битного поля из xmm2 (возможно выбрать любой из 4 полей этого регистра) или из 32-х битной ячейки памяти в произвольное поле результата. Кроме того, для каждого из полей результата можно задать сброс его в +0.0.

    • EXTRACTPS r/m32, xmm, imm8 — (Extract Packed Single Precision Floating-Point Value)

    Извлечение 32-х битного поля из xmm регистра, номер поля указывается в младших 2 битах imm8. Если в качестве результата указан 64-х битный регистр, то его старшие 32 бита сбрасываются (расширение без знака).

    • PINSR{B,D,Q} xmm, r/m*, imm8 — (Insert Byte/Dword/Qword)

    Вставка 8, 32, или 64-х битного значения в указанное поле xmm регистра (остальные поля не изменяются).

    • PEXTR{B,W,D,Q} r/m*, xmm, imm8 — (Extract Byte/Word/Dword/Qword)

    Извлечение 8, 16, 32, 64 битного поля из указанного в imm8 поля xmm регистра. Если в качестве результата указан регистр, то его старшая часть сбрасывается (расширение без знака).

    Скалярное умножение векторов

    • DPPS xmm1, xmm2/m128, imm8 — (Dot Product of Packed Single Precision Floating-Point Values)
    • DPPD xmm1, xmm2/m128, imm8 — (Dot Product of Packed Double Precision Floating-Point Values)

    Скалярное умножение векторов (dot product) 32/64 битных полей. Посредством битовой маски в imm8 указывается, какие произведения полей должны суммироваться и что следует прописать в каждое поле результата: сумму указанных произведений или +0.0.

    Смешивания

    • BLENDV{PS,PD} xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Single/Double Precision Floating-Point Values)

    Выбор каждого 32/64-битного поля результата осуществляется в зависимости от знака такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • BLEND{PS,PD} xmm1, xmm2/m128, imm8 — (Blend Packed Single/Double Precision Floating-Point Values)

    Битовая маска (4 или 2 бита) в imm8 указывает из какого аргумента следует взять каждое 32/64-битное поле результата.

    • PBLENDVB xmm1, xmm2/m128, <xmm0> — (Variable Blend Packed Bytes)

    Выбор каждого байтового поля результата осуществляется в зависимости от знака байта такого же поля в неявном аргументе xmm0: либо из первого, либо из второго аргумента.

    • PBLENDW xmm1, xmm2/m128, imm8 — (Blend Packed Words)

    Битовая маска (8 бит) в imm8 указывает из какого аргумента следует взять каждое 16-битное поле результата.

    Проверки бит

    • PTEST xmm1, xmm2/m128 — (Logical Compare)

    Установить флаг ZF, если только в xmm2/m128 все биты помеченные маской из xmm1 равны нулю. Если все не помеченные биты равны нулю, то установить флаг CF. Остальные флаги (AF, OF, PF, SF) всегда сбрасываются. Инструкция не модифицирует xmm1.

    Округления

    • ROUND{PS, PD} xmm1, xmm2/m128, imm8 — (Round Packed Single/Double Precision Floating-Point Values)

    Округление всех 32/64-х битных полей. Режим округления (4 варианта) выбирается либо из MXCSR.RC, либо задаётся непосредственно в imm8. Также можно подавить генерацию исключения потери точности.

    • ROUND{SS, SD} xmm1, xmm2/m128, imm8 — (Round Scalar Single/Double Precision Floating-Point Values)

    Округление только младшего 32/64-х битного поля (остальные биты остаются неизменными).

    Чтение WC памяти

    • MOVNTDQA xmm1, m128 — (Load Double Quadword Non-Temporal Aligned Hint)

    Операция чтения, позволяющая ускорить (до 7.5 раз) работу с write-combining областями памяти.

    Новые инструкции SSE4.2

    Обработка строк

    Эти инструкции выполняют арифметические сравнения между всеми возможными парами полей (64 или 256 сравнений!) из обеих строк, заданных содержимым xmm1 и xmm2/m128. Затем булевые результаты сравнений обрабатываются для получения нужных результатов. Непосредственный аргумент imm8 управляет размером (байтовые или unicode строки, до 16/8 элементов каждая), знаковостью полей (элементов строк), типом сравнения и интерпретацией результатов.

    Ими можно производить в строке (области памяти) поиск символов из заданного набора или в заданных диапазонах. Можно сравнивать строки (области памяти) или производить поиск подстрок.

    Все они оказывают влияние на флаги процессора: SF устанавливается если в xmm1 не полная строка, ZF — если в xmm2/m128 не полная строка, CF — если результат не нулевой, OF — если младший бит результата не нулевой. Флаги AF и PF сбрасываются.

    • PCMPESTRI <ecx>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре ecx.

    • PCMPESTRM <xmm0>, xmm1, xmm2/m128, <eax>, <edx>, imm8 — ()

    Явное задание размера строк в <eax>, <edx> (берётся абсолютная величина регистров с насыщение до 8/16, в зависимости от размера элементов строк. Результат в регистре xmm0.

    • PCMPISTRI <ecx>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре ecx.

    • PCMPISTRM <xmm0>, xmm1, xmm2/m128, imm8 — ()

    Неявное задание размера строк (производится поиск нулевых элементов к каждой из строк). Результат в регистре xmm0.

    Подсчет CRC32

    • CRC32 r32, r/m* — (Подсчет CRC32)

    Накопление значения CRC-32C (другие обозначения CRC-32/ISCSI CRC-32/CASTAGNOLI) для 8, 16, 32 или 64 битного аргумента (используется полином 0x1EDC6F41).

    Подсчет популяции единичных битов

    • POPCNT r, r/m* — (Return the Count of Number of Bits Set to 1)

    Подсчет числа единичных битов. Три варианта инструкции: для 16, 32 и 64-х битных регистров. Также присутствует в SSE4A от AMD.

    Векторные примитивы

    • PCMPGTQ xmm1, xmm2/m128 — (Compare Packed Qword Data for Greater Than)

    Проверка 64-х битных полей на «больше чем» и выдача 64-х битных масок.

    SSE4a

    Набор инструкций SSE4a был введен компанией AMD в процессоры на архитектуре Barcelona. Это расширение не доступно в процессорах Intel. Поддержка определяется через CPUID.80000001H:ECX.SSE4A[Bit 6] флаг.

    Инструкция Описание
    LZCNT/POPCNT Подсчет числа нулевых/единичных битов.
    EXTRQ/INSERTQ Комбинированные инструкции маскирования и сдвига
    MOVNTSD/MOVNTSS Скалярные инструкции потоковой записи

    AMD