Advanced Vector Extensions (AVX) — расширение системы команд x86 для микропроцессоров Intel и AMD.
AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.
Улучшения
- Новая схема кодирования инструкций VEX
- Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE-инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM-регистрами добавлены новые 256-битные AVX-инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Xeon Phi уже в 2012 году имели векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD-команды с MVEX- и VEX-префиксами, но при этом они не поддерживают AVX.
- Неразрушающие операции. Набор AVX-инструкций использует трёхоперандный синтаксис. Например, вместо {\displaystyle a=a+b}
можно использовать {\displaystyle c=a+b}
, при этом регистр {\displaystyle a}
остаётся неизменённым. В случаях, когда значение {\displaystyle a}
используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший {\displaystyle a}
, из другого регистра или памяти.
- Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.
- Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX-инструкции сохраняют прочие преимущества AVX, такие, как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти.
- Intel рекомендует отказаться от старых SSE инструкций в пользу новых 128-битных AVX-инструкций, даже если достаточно двух операндов.
Новая схема кодирования
Новая схема кодирования инструкций VEX использует VEX-префикс. В настоящий момент существуют два VEX-префикса, длиной 2 и 3 байта. Для 2-хбайтного VEX-префикса первый байт равен 0xC5, для 3-х байтного — 0xC4.
В 64-битном режиме первый байт VEX-префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS.
Длина существующих AVX-инструкций, вместе с VEX-префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.
Новые инструкции
Инструкция |
Описание |
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128 |
Копирует 32-х-, 64-х- или 128-битный операнд из памяти во все элементы векторного регистра XMM или YMM. |
VINSERTF128 |
Замещает младшую или старшую половину 256-битного регистра YMM значением 128-битного операнда. Другая часть регистра-получателя не изменяется. |
VEXTRACTF128 |
Извлекает младшую или старшую половину 256-битного регистра YMM и копирует в 128-битный операнд-назначение. |
VMASKMOVPS, VMASKMOVPD |
Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными. |
VPERMILPS, VPERMILPD |
Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра). |
VPERM2F128 |
Переставляет 4 128-битных элемента двух 256-битных регистров в 256-битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора. |
VZEROALL |
Обнуляет все YMM-регистры и помечает их как неиспользуемые. Используется при переключении между 128-битным режимом и 256-битным. |
VZEROUPPER |
Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-битным режимом и 256-битным. |
Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)
- PCLMULLQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 00]
- PCLMULHQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 01]
- PCLMULLQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 02]
- PCLMULHQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 03]
- PCLMULQDQ xmmreg, xmmrm, imm [rmi: 66 0f 3a 44 /r ib]
Применение
Подходит для интенсивных вычислений с плавающей точкой в мультимедиа-программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.
Инструкции и примеры
__m256i _mm256_abs_epi16 (__m256i a)
Synopsis
__m256i _mm256_abs_epi16 (__m256i a)
#include «immintrin.h»
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2
Description
Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0
Performance
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)
Synopsis
__m256i _mm256_abs_epi32 (__m256i a)
#include «immintrin.h»
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2
Description
Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)
Synopsis
__m256i _mm256_abs_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2
Description
Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0
Performance
vpaddw
__m256i _mm256_add_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 16-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0
Performance
vpaddd
__m256i _mm256_add_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 32-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpaddq
__m256i _mm256_add_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 64-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vpaddb
__m256i _mm256_add_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 8-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0
Performance
vaddpd
__m256d _mm256_add_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_add_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vaddps
__m256 _mm256_add_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_add_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX
Description
Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpaddsw
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 16-bit integers in a and b using saturation, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0
Performance
vpaddsb
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed 8-bit integers in a and b using saturation, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0
Performance
vpaddusw
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0
Performance
vpaddusb
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0
Performance
vaddsubpd
__m256d _mm256_addsub_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_addsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddsubpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF (j is even) dst[i+63:i] := a[i+63:i] — b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] + b[i+63:i] FI ENDFOR dst[MAX:256] := 0
Performance
vaddsubps
__m256 _mm256_addsub_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_addsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddsubps ymm, ymm, ymm
CPUID Flags: AVX
Description
Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] — b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpalignr
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
Synopsis
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
#include «immintrin.h»
Instruction: vpalignr ymm, ymm, ymm, imm
CPUID Flags: AVX2
Description
Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by count bytes, and store the low 16 bytes in dst.
Operation
FOR j := 0 to 1 i := j*128 tmp[255:0] := ((a[i+127:i] << 128) OR b[i+127:i]) >> (count[7:0]*8) dst[i+127:i] := tmp[127:0] ENDFOR dst[MAX:256] := 0
Performance
vandpd
__m256d _mm256_and_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_and_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] AND b[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vandps
__m256 _mm256_and_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_and_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := (a[i+31:i] AND b[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpand
__m256i _mm256_and_si256 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_and_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpand ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.
Operation
dst[255:0] := (a[255:0] AND b[255:0]) dst[MAX:256] := 0
Performance
vandnpd
__m256d _mm256_andnot_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_andnot_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandnpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ((NOT a[i+63:i]) AND b[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vandnps
__m256 _mm256_andnot_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_andnot_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandnps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ((NOT a[i+31:i]) AND b[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpandn
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpandn ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.
Operation
dst[255:0] := ((NOT a[255:0]) AND b[255:0]) dst[MAX:256] := 0
Performance
vpavgw
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Average packed unsigned 16-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := (a[i+15:i] + b[i+15:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0
Performance
vpavgb
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Average packed unsigned 8-bit integers in a and b, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := (a[i+7:i] + b[i+7:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0
Performance
vpblendw
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
Synopsis
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendw ymm, ymm, ymm, imm
CPUID Flags: AVX2
Description
Blend packed 16-bit integers from a and b using control mask imm8, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpblendd
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
Synopsis
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd xmm, xmm, xmm, imm
CPUID Flags: AVX2
Description
Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:128] := 0
Performance
vpblendd
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
Synopsis
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd ymm, ymm, ymm, imm
CPUID Flags: AVX2
Description
Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vblendpd
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
Synopsis
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vblendpd ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF imm8[j%8] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0
Performance
vblendps
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
Synopsis
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vblendps ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpblendvb
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
Synopsis
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
#include «immintrin.h»
Instruction: vpblendvb ymm, ymm, ymm, ymm
CPUID Flags: AVX2
Description
Blend packed 8-bit integers from a and b using mask, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 IF mask[i+7] dst[i+7:i] := b[i+7:i] ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vblendvpd
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
Synopsis
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
#include «immintrin.h»
Instruction: vblendvpd ymm, ymm, ymm, ymm
CPUID Flags: AVX
Description
Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0
Performance
vblendvps
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
Synopsis
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
#include «immintrin.h»
Instruction: vblendvps ymm, ymm, ymm, ymm
CPUID Flags: AVX
Description
Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vbroadcastf128
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
Synopsis
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX
Description
Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.
Operation
tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0
Performance
vbroadcastf128
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
Synopsis
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX
Description
Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.
Operation
tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0
Performance
vbroadcastsd
__m256d _mm256_broadcast_sd (double const * mem_addr)
Synopsis
__m256d _mm256_broadcast_sd (double const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, m64
CPUID Flags: AVX
Description
Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.
Operation
tmp[63:0] = MEM[mem_addr+63:mem_addr] FOR j := 0 to 3 i := j*64 dst[i+63:i] := tmp[63:0] ENDFOR dst[MAX:256] := 0
Performance
vbroadcastss
__m128 _mm_broadcast_ss (float const * mem_addr)
Synopsis
__m128 _mm_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss xmm, m32
CPUID Flags: AVX
Description
Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.
Operation
tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 3 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:128] := 0
vbroadcastss
__m256 _mm256_broadcast_ss (float const * mem_addr)
Synopsis
__m256 _mm256_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss ymm, m32
CPUID Flags: AVX
Description
Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.
Operation
tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 7 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0
Performance
vpbroadcastb
__m128i _mm_broadcastb_epi8 (__m128i a)
Synopsis
__m128i _mm_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 8-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:128] := 0
Performance
vpbroadcastb
__m256i _mm256_broadcastb_epi8 (__m128i a)
Synopsis
__m256i _mm256_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 8-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
Performance
vpbroadcastd
__m128i _mm_broadcastd_epi32 (__m128i a)
Synopsis
__m128i _mm_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 32-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0
Performance
vpbroadcastd
__m256i _mm256_broadcastd_epi32 (__m128i a)
Synopsis
__m256i _mm256_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 32-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
Performance
vpbroadcastq
__m128i _mm_broadcastq_epi64 (__m128i a)
Synopsis
__m128i _mm_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 64-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0
Performance
vpbroadcastq
__m256i _mm256_broadcastq_epi64 (__m128i a)
Synopsis
__m256i _mm256_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 64-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
Performance
movddup
__m128d _mm_broadcastsd_pd (__m128d a)
Synopsis
__m128d _mm_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: movddup xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0
Performance
vbroadcastsd
__m256d _mm256_broadcastsd_pd (__m128d a)
Synopsis
__m256d _mm256_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
Performance
vbroadcasti128
__m256i _mm256_broadcastsi128_si256 (__m128i a)
Synopsis
__m256i _mm256_broadcastsi128_si256 (__m128i a)
#include «immintrin.h»
Instruction: vbroadcasti128 ymm, m128
CPUID Flags: AVX2
Description
Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.
Operation
dst[127:0] := a[127:0] dst[255:128] := a[127:0] dst[MAX:256] := 0
vbroadcastss
__m128 _mm_broadcastss_ps (__m128 a)
Synopsis
__m128 _mm_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0
Performance
vbroadcastss
__m256 _mm256_broadcastss_ps (__m128 a)
Synopsis
__m256 _mm256_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
Performance
vpbroadcastw
__m128i _mm_broadcastw_epi16 (__m128i a)
Synopsis
__m128i _mm_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw xmm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 16-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:128] := 0
Performance
vpbroadcastw
__m256i _mm256_broadcastw_epi16 (__m128i a)
Synopsis
__m256i _mm256_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw ymm, xmm
CPUID Flags: AVX2
Description
Broadcast the low packed 16-bit integer from a to all elements of dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
Performance
vpslldq
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.
Operation
tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0
Performance
vpsrldq
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.
Operation
tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0
Performance
__m256 _mm256_castpd_ps (__m256d a)
Synopsis
__m256 _mm256_castpd_ps (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Cast vector of type __m256d to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castpd_si256 (__m256d a)
Synopsis
__m256i _mm256_castpd_si256 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256d to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castpd128_pd256 (__m128d a)
Synopsis
__m256d _mm256_castpd128_pd256 (__m128d a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128d _mm256_castpd256_pd128 (__m256d a)
Synopsis
__m128d _mm256_castpd256_pd128 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256d to type __m128d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castps_pd (__m256 a)
Synopsis
__m256d _mm256_castps_pd (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Cast vector of type __m256 to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castps_si256 (__m256 a)
Synopsis
__m256i _mm256_castps_si256 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256 to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castps128_ps256 (__m128 a)
Synopsis
__m256 _mm256_castps128_ps256 (__m128 a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m128 to type __m256; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128 _mm256_castps256_ps128 (__m256 a)
Synopsis
__m128 _mm256_castps256_ps128 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256 to type __m128. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castsi128_si256 (__m128i a)
Synopsis
__m256i _mm256_castsi128_si256 (__m128i a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castsi256_pd (__m256i a)
Synopsis
__m256d _mm256_castsi256_pd (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256i to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castsi256_ps (__m256i a)
Synopsis
__m256 _mm256_castsi256_ps (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256i to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128i _mm256_castsi256_si128 (__m256i a)
Synopsis
__m128i _mm256_castsi256_si128 (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Casts vector of type __m256i to type __m128i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
vroundpd
__m256d _mm256_ceil_pd (__m256d a)
Synopsis
__m256d _mm256_ceil_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := CEIL(a[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vroundps
__m256 _mm256_ceil_ps (__m256 a)
Synopsis
__m256 _mm256_ceil_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := CEIL(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vcmppd
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
Synopsis
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd xmm, xmm, xmm, imm
CPUID Flags: AVX
Description
Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 1 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0
Performance
vcmppd
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
Synopsis
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vcmpps
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
Synopsis
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps xmm, xmm, xmm, imm
CPUID Flags: AVX
Description
Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0
Performance
vcmpps
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
Synopsis
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vcmpsd
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
Synopsis
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmpsd xmm, xmm, xmm, imm
CPUID Flags: AVX
Description
Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[63:0] := ( a[63:0] OP b[63:0] ) ? 0xFFFFFFFFFFFFFFFF : 0 dst[127:64] := a[127:64] dst[MAX:128] := 0
Performance
vcmpss
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
Synopsis
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpss xmm, xmm, xmm, imm
CPUID Flags: AVX
Description
Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.
Operation
CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[31:0] := ( a[31:0] OP b[31:0] ) ? 0xFFFFFFFF : 0 dst[127:32] := a[127:32] dst[MAX:128] := 0
Performance
vpcmpeqw
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 16-bit integers in a and b for equality, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] == b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpeqd
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 32-bit integers in a and b for equality, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpeqq
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 64-bit integers in a and b for equality, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] == b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpeqb
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 8-bit integers in a and b for equality, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] == b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpgtw
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 16-bit integers in a and b for greater-than, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] > b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpgtd
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 32-bit integers in a and b for greater-than, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] > b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpgtq
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 64-bit integers in a and b for greater-than, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] > b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpcmpgtb
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 8-bit integers in a and b for greater-than, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] > b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0
Performance
vpmovsxwd
__m256i _mm256_cvtepi16_epi32 (__m128i a)
Synopsis
__m256i _mm256_cvtepi16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwd ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j:= 0 to 7 i := 32*j k := 16*j dst[i+31:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovsxwq
__m256i _mm256_cvtepi16_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepi16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwq ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.
Operation
FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovsxdq
__m256i _mm256_cvtepi32_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepi32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxdq ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.
Operation
FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := SignExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0
Performance
vcvtdq2pd
__m256d _mm256_cvtepi32_pd (__m128i a)
Synopsis
__m256d _mm256_cvtepi32_pd (__m128i a)
#include «immintrin.h»
Instruction: vcvtdq2pd ymm, xmm
CPUID Flags: AVX
Description
Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*32 m := j*64 dst[m+63:m] := Convert_Int32_To_FP64(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vcvtdq2ps
__m256 _mm256_cvtepi32_ps (__m256i a)
Synopsis
__m256 _mm256_cvtepi32_ps (__m256i a)
#include «immintrin.h»
Instruction: vcvtdq2ps ymm, ymm
CPUID Flags: AVX
Description
Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpmovsxbw
__m256i _mm256_cvtepi8_epi16 (__m128i a)
Synopsis
__m256i _mm256_cvtepi8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbw ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := SignExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0
Performance
vpmovsxbd
__m256i _mm256_cvtepi8_epi32 (__m128i a)
Synopsis
__m256i _mm256_cvtepi8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbd ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovsxbq
__m256i _mm256_cvtepi8_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepi8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbq ymm, xmm
CPUID Flags: AVX2
Description
Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.
Operation
FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxwd
__m256i _mm256_cvtepu16_epi32 (__m128i a)
Synopsis
__m256i _mm256_cvtepu16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwd ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j k := 16*j dst[i+31:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxwq
__m256i _mm256_cvtepu16_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepu16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwq ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.
Operation
FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxdq
__m256i _mm256_cvtepu32_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepu32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxdq ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.
Operation
FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := ZeroExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxbw
__m256i _mm256_cvtepu8_epi16 (__m128i a)
Synopsis
__m256i _mm256_cvtepu8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbw ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := ZeroExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxbd
__m256i _mm256_cvtepu8_epi32 (__m128i a)
Synopsis
__m256i _mm256_cvtepu8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbd ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0
Performance
vpmovzxbq
__m256i _mm256_cvtepu8_epi64 (__m128i a)
Synopsis
__m256i _mm256_cvtepu8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbq ymm, xmm
CPUID Flags: AVX2
Description
Zero extend packed unsigned 8-bit integers in the low 8 byte sof a to packed 64-bit integers, and store the results in dst.
Operation
FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0
Performance
vcvtpd2dq
__m128i _mm256_cvtpd_epi32 (__m256d a)
Synopsis
__m128i _mm256_cvtpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2dq xmm, ymm
CPUID Flags: AVX
Description
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32(a[k+63:k]) ENDFOR dst[MAX:128] := 0
Performance
vcvtpd2ps
__m128 _mm256_cvtpd_ps (__m256d a)
Synopsis
__m128 _mm256_cvtpd_ps (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2ps xmm, ymm
CPUID Flags: AVX
Description
Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.
Operation
FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_FP32(a[k+63:k]) ENDFOR dst[MAX:128] := 0
Performance
vcvtps2dq
__m256i _mm256_cvtps_epi32 (__m256 a)
Synopsis
__m256i _mm256_cvtps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvtps2dq ymm, ymm
CPUID Flags: AVX
Description
Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vcvtps2pd
__m256d _mm256_cvtps_pd (__m128 a)
Synopsis
__m256d _mm256_cvtps_pd (__m128 a)
#include «immintrin.h»
Instruction: vcvtps2pd ymm, xmm
CPUID Flags: AVX
Description
Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.
Operation
FOR j := 0 to 3 i := 64*j k := 32*j dst[i+63:i] := Convert_FP32_To_FP64(a[k+31:k]) ENDFOR dst[MAX:256] := 0
Performance
vcvttpd2dq
__m128i _mm256_cvttpd_epi32 (__m256d a)
Synopsis
__m128i _mm256_cvttpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvttpd2dq xmm, ymm
CPUID Flags: AVX
Description
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.
Operation
FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32_Truncate(a[k+63:k]) ENDFOR dst[MAX:128] := 0
Performance
vcvttps2dq
__m256i _mm256_cvttps_epi32 (__m256 a)
Synopsis
__m256i _mm256_cvttps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvttps2dq ymm, ymm
CPUID Flags: AVX
Description
Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32_Truncate(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vdivpd
__m256d _mm256_div_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_div_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vdivpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.
Operation
FOR j := 0 to 3 i := 64*j dst[i+63:i] := a[i+63:i] / b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vdivps
__m256 _mm256_div_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_div_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vdivps ymm, ymm, ymm
CPUID Flags: AVX
Description
Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.
Operation
FOR j := 0 to 7 i := 32*j dst[i+31:i] := a[i+31:i] / b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vdpps
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
Synopsis
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vdpps ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.
Operation
DP(a[127:0], b[127:0], imm8[7:0]) { FOR j := 0 to 3 i := j*32 IF imm8[(4+j)%8] temp[i+31:i] := a[i+31:i] * b[i+31:i] ELSE temp[i+31:i] := 0 FI ENDFOR sum[31:0] := (temp[127:96] + temp[95:64]) + (temp[63:32] + temp[31:0]) FOR j := 0 to 3 i := j*32 IF imm8[j%8] tmpdst[i+31:i] := sum[31:0] ELSE tmpdst[i+31:i] := 0 FI ENDFOR RETURN tmpdst[127:0] } dst[127:0] := DP(a[127:0], b[127:0], imm8[7:0]) dst[255:128] := DP(a[255:128], b[255:128], imm8[7:0]) dst[MAX:256] := 0
Performance
…
__int16 _mm256_extract_epi16 (__m256i a, const int index)
Synopsis
__int16 _mm256_extract_epi16 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Extract a 16-bit integer from a, selected with index, and store the result in dst.
Operation
dst[15:0] := (a[255:0] >> (index * 16))[15:0]
…
__int32 _mm256_extract_epi32 (__m256i a, const int index)
Synopsis
__int32 _mm256_extract_epi32 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Extract a 32-bit integer from a, selected with index, and store the result in dst.
Operation
dst[31:0] := (a[255:0] >> (index * 32))[31:0]
…
__int64 _mm256_extract_epi64 (__m256i a, const int index)
Synopsis
__int64 _mm256_extract_epi64 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Extract a 64-bit integer from a, selected with index, and store the result in dst.
Operation
dst[63:0] := (a[255:0] >> (index * 64))[63:0]
…
__int8 _mm256_extract_epi8 (__m256i a, const int index)
Synopsis
__int8 _mm256_extract_epi8 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Extract an 8-bit integer from a, selected with index, and store the result in dst.
Operation
dst[7:0] := (a[255:0] >> (index * 8))[7:0]
vextractf128
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
Synopsis
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX
Description
Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.
Operation
CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0
Performance
vextractf128
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
Synopsis
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX
Description
Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.
Operation
CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0
Performance
vextractf128
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
Synopsis
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX
Description
Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.
Operation
CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0
Performance
vextracti128
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
Synopsis
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextracti128 xmm, ymm, imm
CPUID Flags: AVX2
Description
Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.
Operation
CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0
Performance
vroundpd
__m256d _mm256_floor_pd (__m256d a)
Synopsis
__m256d _mm256_floor_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := FLOOR(a[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vroundps
__m256 _mm256_floor_ps (__m256 a)
Synopsis
__m256 _mm256_floor_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := FLOOR(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vphaddw
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.
Operation
dst[15:0] := a[31:16] + a[15:0] dst[31:16] := a[63:48] + a[47:32] dst[47:32] := a[95:80] + a[79:64] dst[63:48] := a[127:112] + a[111:96] dst[79:64] := b[31:16] + b[15:0] dst[95:80] := b[63:48] + b[47:32] dst[111:96] := b[95:80] + b[79:64] dst[127:112] := b[127:112] + b[111:96] dst[143:128] := a[159:144] + a[143:128] dst[159:144] := a[191:176] + a[175:160] dst[175:160] := a[223:208] + a[207:192] dst[191:176] := a[255:240] + a[239:224] dst[207:192] := b[127:112] + b[143:128] dst[223:208] := b[159:144] + b[175:160] dst[239:224] := b[191:176] + b[207:192] dst[255:240] := b[223:208] + b[239:224] dst[MAX:256] := 0
Performance
vphaddd
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.
Operation
dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0
Performance
vhaddpd
__m256d _mm256_hadd_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_hadd_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhaddpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.
Operation
dst[63:0] := a[127:64] + a[63:0] dst[127:64] := b[127:64] + b[63:0] dst[191:128] := a[255:192] + a[191:128] dst[255:192] := b[255:192] + b[191:128] dst[MAX:256] := 0
Performance
vhaddps
__m256 _mm256_hadd_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_hadd_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhaddps ymm, ymm, ymm
CPUID Flags: AVX
Description
Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.
Operation
dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0
Performance
vphaddsw
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally add adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.
Operation
dst[15:0]= Saturate_To_Int16(a[31:16] + a[15:0]) dst[31:16] = Saturate_To_Int16(a[63:48] + a[47:32]) dst[47:32] = Saturate_To_Int16(a[95:80] + a[79:64]) dst[63:48] = Saturate_To_Int16(a[127:112] + a[111:96]) dst[79:64] = Saturate_To_Int16(b[31:16] + b[15:0]) dst[95:80] = Saturate_To_Int16(b[63:48] + b[47:32]) dst[111:96] = Saturate_To_Int16(b[95:80] + b[79:64]) dst[127:112] = Saturate_To_Int16(b[127:112] + b[111:96]) dst[143:128] = Saturate_To_Int16(a[159:144] + a[143:128]) dst[159:144] = Saturate_To_Int16(a[191:176] + a[175:160]) dst[175:160] = Saturate_To_Int16( a[223:208] + a[207:192]) dst[191:176] = Saturate_To_Int16(a[255:240] + a[239:224]) dst[207:192] = Saturate_To_Int16(b[127:112] + b[143:128]) dst[223:208] = Saturate_To_Int16(b[159:144] + b[175:160]) dst[239:224] = Saturate_To_Int16(b[191-160] + b[159-128]) dst[255:240] = Saturate_To_Int16(b[255:240] + b[239:224]) dst[MAX:256] := 0
Performance
vphsubw
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.
Operation
dst[15:0] := a[15:0] — a[31:16] dst[31:16] := a[47:32] — a[63:48] dst[47:32] := a[79:64] — a[95:80] dst[63:48] := a[111:96] — a[127:112] dst[79:64] := b[15:0] — b[31:16] dst[95:80] := b[47:32] — b[63:48] dst[111:96] := b[79:64] — b[95:80] dst[127:112] := b[111:96] — b[127:112] dst[143:128] := a[143:128] — a[159:144] dst[159:144] := a[175:160] — a[191:176] dst[175:160] := a[207:192] — a[223:208] dst[191:176] := a[239:224] — a[255:240] dst[207:192] := b[143:128] — b[159:144] dst[223:208] := b[175:160] — b[191:176] dst[239:224] := b[207:192] — b[223:208] dst[255:240] := b[239:224] — b[255:240] dst[MAX:256] := 0
Performance
vphsubd
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.
Operation
dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0
Performance
vhsubpd
__m256d _mm256_hsub_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_hsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhsubpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.
Operation
dst[63:0] := a[63:0] — a[127:64] dst[127:64] := b[63:0] — b[127:64] dst[191:128] := a[191:128] — a[255:192] dst[255:192] := b[191:128] — b[255:192] dst[MAX:256] := 0
Performance
vhsubps
__m256 _mm256_hsub_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_hsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhsubps ymm, ymm, ymm
CPUID Flags: AVX
Description
Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.
Operation
dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0
Performance
vphsubsw
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Horizontally subtract adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.
Operation
dst[15:0]= Saturate_To_Int16(a[15:0] — a[31:16]) dst[31:16] = Saturate_To_Int16(a[47:32] — a[63:48]) dst[47:32] = Saturate_To_Int16(a[79:64] — a[95:80]) dst[63:48] = Saturate_To_Int16(a[111:96] — a[127:112]) dst[79:64] = Saturate_To_Int16(b[15:0] — b[31:16]) dst[95:80] = Saturate_To_Int16(b[47:32] — b[63:48]) dst[111:96] = Saturate_To_Int16(b[79:64] — b[95:80]) dst[127:112] = Saturate_To_Int16(b[111:96] — b[127:112]) dst[143:128]= Saturate_To_Int16(a[143:128] — a[159:144]) dst[159:144] = Saturate_To_Int16(a[175:160] — a[191:176]) dst[175:160] = Saturate_To_Int16(a[207:192] — a[223:208]) dst[191:176] = Saturate_To_Int16(a[239:224] — a[255:240]) dst[207:192] = Saturate_To_Int16(b[143:128] — b[159:144]) dst[223:208] = Saturate_To_Int16(b[175:160] — b[191:176]) dst[239:224] = Saturate_To_Int16(b[207:192] — b[223:208]) dst[255:240] = Saturate_To_Int16(b[239:224] — b[255:240]) dst[MAX:256] := 0
Performance
vpgatherdd
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0
Performance
vpgatherdd
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)
Synopsis
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vpgatherdd
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
Synopsis
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0
Performance
vpgatherdd
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256imask, const int scale)
Synopsis
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vpgatherdq
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0
Performance
vpgatherdq
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)
Synopsis
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vpgatherdq
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
Synopsis
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0
Performance
vpgatherdq
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
Synopsis
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vgatherdpd
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0
Performance
vgatherdpd
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)
Synopsis
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vgatherdpd
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
Synopsis
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0
Performance
vgatherdpd
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256dmask, const int scale)
Synopsis
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vgatherdps
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0
Performance
vgatherdps
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
Synopsis
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vgatherdps
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
Synopsis
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0
Performance
vgatherdps
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256mask, const int scale)
Synopsis
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vpgatherqd
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0
Performance
vpgatherqd
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)
Synopsis
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0
Performance
vpgatherqd
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
Synopsis
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
vpgatherqd
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128imask, const int scale)
Synopsis
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
vpgatherqq
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
Performance
vpgatherqq
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)
Synopsis
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vpgatherqq
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
Synopsis
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0
Performance
vpgatherqq
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
Synopsis
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vgatherqpd
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
Performance
vgatherqpd
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)
Synopsis
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
vgatherqpd
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
Synopsis
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0
Performance
vgatherqpd
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256dmask, const int scale)
Synopsis
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0
Performance
vgatherqps
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
Synopsis
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0
Performance
vgatherqps
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
Synopsis
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0
Performance
vgatherqps
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
Synopsis
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
Performance
vgatherqps
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128mask, const int scale)
Synopsis
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2
Description
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
Performance
…
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
Synopsis
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.
Operation
dst[255:0] := a[255:0] sel := index*16 dst[sel+15:sel] := i[15:0]
…
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
Synopsis
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.
Operation
dst[255:0] := a[255:0] sel := index*32 dst[sel+31:sel] := i[31:0]
…
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
Synopsis
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.
Operation
dst[255:0] := a[255:0] sel := index*64 dst[sel+63:sel] := i[63:0]
…
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
Synopsis
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX
Description
Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.
Operation
dst[255:0] := a[255:0] sel := index*8 dst[sel+7:sel] := i[7:0]
vinsertf128
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
Synopsis
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.
Operation
dst[255:0] := a[255:0] CASE imm8[7:0] of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0
Performance
vinsertf128
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
Synopsis
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.
Operation
dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0
Performance
vinsertf128
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
Synopsis
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.
Operation
dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0
Performance
vinserti128
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
Synopsis
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vinserti128 ymm, ymm, xmm, imm
CPUID Flags: AVX2
Description
Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.
Operation
dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0
Performance
vlddqu
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
Synopsis
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vlddqu ymm, m256
CPUID Flags: AVX
Description
Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovapd
__m256d _mm256_load_pd (double const * mem_addr)
Synopsis
__m256d _mm256_load_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovapd ymm, m256
CPUID Flags: AVX
Description
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovaps
__m256 _mm256_load_ps (float const * mem_addr)
Synopsis
__m256 _mm256_load_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovaps ymm, m256
CPUID Flags: AVX
Description
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqa
__m256i _mm256_load_si256 (__m256i const * mem_addr)
Synopsis
__m256i _mm256_load_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqa ymm, m256
CPUID Flags: AVX
Description
Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovupd
__m256d _mm256_loadu_pd (double const * mem_addr)
Synopsis
__m256d _mm256_loadu_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovupd ymm, m256
CPUID Flags: AVX
Description
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovups
__m256 _mm256_loadu_ps (float const * mem_addr)
Synopsis
__m256 _mm256_loadu_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovups ymm, m256
CPUID Flags: AVX
Description
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqu
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
Synopsis
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqu ymm, m256
CPUID Flags: AVX
Description
Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
…
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
Synopsis
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX
Description
Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
…
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
Synopsis
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX
Description
Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
…
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
Synopsis
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX
Description
Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
vpmaddwd
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddwd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i+16]*b[i+31:i+16] + a[i+15:i]*b[i+15:i] ENDFOR dst[MAX:256] := 0
Performance
vpmaddubsw
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddubsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] ) ENDFOR dst[MAX:256] := 0
Performance
vpmaskmovd
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
Synopsis
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovd xmm, xmm, m128
CPUID Flags: AVX2
Description
Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0
Performance
vpmaskmovd
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
Synopsis
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovd ymm, ymm, m256
CPUID Flags: AVX2
Description
Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0
Performance
vpmaskmovq
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
Synopsis
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovq xmm, xmm, m128
CPUID Flags: AVX2
Description
Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0
Performance
vpmaskmovq
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
Synopsis
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovq ymm, ymm, m256
CPUID Flags: AVX2
Description
Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0
Performance
vmaskmovpd
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
Synopsis
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovpd xmm, xmm, m128
CPUID Flags: AVX
Description
Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0
Performance
vmaskmovpd
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
Synopsis
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovpd ymm, ymm, m256
CPUID Flags: AVX
Description
Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0
Performance
vmaskmovps
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
Synopsis
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovps xmm, xmm, m128
CPUID Flags: AVX
Description
Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0
Performance
vmaskmovps
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
Synopsis
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovps ymm, ymm, m256
CPUID Flags: AVX
Description
Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0
Performance
vpmaskmovd
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
Synopsis
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovd m128, xmm, xmm
CPUID Flags: AVX2
Description
Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR
Performance
vpmaskmovd
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
Synopsis
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovd m256, ymm, ymm
CPUID Flags: AVX2
Description
Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR
Performance
vpmaskmovq
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
Synopsis
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovq m128, xmm, xmm
CPUID Flags: AVX2
Description
Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR
Performance
vpmaskmovq
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
Synopsis
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovq m256, ymm, ymm
CPUID Flags: AVX2
Description
Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR
Performance
vmaskmovpd
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
Synopsis
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
#include «immintrin.h»
Instruction: vmaskmovpd m128, xmm, xmm
CPUID Flags: AVX
Description
Store packed double-precision (64-bit) floating-point elements from a into memory using mask.
Operation
FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR
Performance
vmaskmovpd
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
Synopsis
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
#include «immintrin.h»
Instruction: vmaskmovpd m256, ymm, ymm
CPUID Flags: AVX
Description
Store packed double-precision (64-bit) floating-point elements from a into memory using mask.
Operation
FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR
Performance
vmaskmovps
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
Synopsis
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
#include «immintrin.h»
Instruction: vmaskmovps m128, xmm, xmm
CPUID Flags: AVX
Description
Store packed single-precision (32-bit) floating-point elements from a into memory using mask.
Operation
FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR
Performance
vmaskmovps
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
Synopsis
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
#include «immintrin.h»
Instruction: vmaskmovps m256, ymm, ymm
CPUID Flags: AVX
Description
Store packed single-precision (32-bit) floating-point elements from a into memory using mask.
Operation
FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR
Performance
vpmaxsw
__m256i _mm256_max_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 16-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpmaxsd
__m256i _mm256_max_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 32-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpmaxsb
__m256i _mm256_max_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 8-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpmaxuw
__m256i _mm256_max_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxuw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpmaxud
__m256i _mm256_max_epu32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxud ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpmaxub
__m256i _mm256_max_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_max_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxub ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vmaxpd
__m256d _mm256_max_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_max_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmaxpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := MAX(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vmaxps
__m256 _mm256_max_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_max_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmaxps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := MAX(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpminsw
__m256i _mm256_min_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 16-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpminsd
__m256i _mm256_min_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 32-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpminsb
__m256i _mm256_min_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed 8-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpminuw
__m256i _mm256_min_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminuw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpminud
__m256i _mm256_min_epu32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminud ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpminub
__m256i _mm256_min_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_min_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminub ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vminpd
__m256d _mm256_min_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_min_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vminpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := MIN(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vminps
__m256 _mm256_min_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_min_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vminps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := MIN(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vmovddup
__m256d _mm256_movedup_pd (__m256d a)
Synopsis
__m256d _mm256_movedup_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovddup ymm, ymm
CPUID Flags: AVX
Description
Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.
Operation
dst[63:0] := a[63:0] dst[127:64] := a[63:0] dst[191:128] := a[191:128] dst[255:192] := a[191:128] dst[MAX:256] := 0
Performance
vmovshdup
__m256 _mm256_movehdup_ps (__m256 a)
Synopsis
__m256 _mm256_movehdup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovshdup ymm, ymm
CPUID Flags: AVX
Description
Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.
Operation
dst[31:0] := a[63:32] dst[63:32] := a[63:32] dst[95:64] := a[127:96] dst[127:96] := a[127:96] dst[159:128] := a[191:160] dst[191:160] := a[191:160] dst[223:192] := a[255:224] dst[255:224] := a[255:224] dst[MAX:256] := 0
Performance
vmovsldup
__m256 _mm256_moveldup_ps (__m256 a)
Synopsis
__m256 _mm256_moveldup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovsldup ymm, ymm
CPUID Flags: AVX
Description
Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.
Operation
dst[31:0] := a[31:0] dst[63:32] := a[31:0] dst[95:64] := a[95:64] dst[127:96] := a[95:64] dst[159:128] := a[159:128] dst[191:160] := a[159:128] dst[223:192] := a[223:192] dst[255:224] := a[223:192] dst[MAX:256] := 0
Performance
vpmovmskb
int _mm256_movemask_epi8 (__m256i a)
Synopsis
int _mm256_movemask_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpmovmskb r32, ymm
CPUID Flags: AVX2
Description
Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[j] := a[i+7] ENDFOR
Performance
vmovmskpd
int _mm256_movemask_pd (__m256d a)
Synopsis
int _mm256_movemask_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovmskpd r32, ymm
CPUID Flags: AVX
Description
Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.
Operation
FOR j := 0 to 3 i := j*64 IF a[i+63] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:4] := 0
Performance
vmovmskps
int _mm256_movemask_ps (__m256 a)
Synopsis
int _mm256_movemask_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovmskps r32, ymm
CPUID Flags: AVX
Description
Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.
Operation
FOR j := 0 to 7 i := j*32 IF a[i+31] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:8] := 0
Performance
vmpsadbw
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
Synopsis
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vmpsadbw ymm, ymm, ymm, imm
CPUID Flags: AVX2
Description
Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from bstarting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.
Operation
MPSADBW(a[127:0], b[127:0], imm8[2:0]) { a_offset := imm8[2]*32 b_offset := imm8[1:0]*32 FOR j := 0 to 7 i := j*8 k := a_offset+i l := b_offset tmp[i+15:i] := ABS(a[k+7:k] — b[l+7:l]) + ABS(a[k+15:k+8] — b[l+15:l+8]) + ABS(a[k+23:k+16] — b[l+23:l+16]) + ABS(a[k+31:k+24] — b[l+31:l+24]) ENDFOR RETURN tmp[127:0] } dst[127:0] := MPSADBW(a[127:0], b[127:0], imm8[2:0]) dst[255:128] := MPSADBW(a[255:128], b[255:128], imm8[5:3]) dst[MAX:256] := 0
Performance
vpmuldq
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuldq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpmuludq
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuludq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vmulpd
__m256d _mm256_mul_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_mul_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmulpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] * b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vmulps
__m256 _mm256_mul_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_mul_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmulps ymm, ymm, ymm
CPUID Flags: AVX
Description
Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpmulhw
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.
Operation
FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0
Performance
vpmulhuw
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhuw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.
Operation
FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0
Performance
vpmulhrsw
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhrsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.
Operation
FOR j := 0 to 15 i := j*16 tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1 dst[i+15:i] := tmp[16:1] ENDFOR dst[MAX:256] := 0
Performance
vpmullw
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmullw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.
Operation
FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[15:0] ENDFOR dst[MAX:256] := 0
Performance
vpmulld
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulld ymm, ymm, ymm
CPUID Flags: AVX2
Description
Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.
Operation
FOR j := 0 to 7 i := j*32 tmp[63:0] := a[i+31:i] * b[i+31:i] dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0
Performance
vorpd
__m256d _mm256_or_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_or_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vorpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] BITWISE OR b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vorps
__m256 _mm256_or_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_or_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vorps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] BITWISE OR b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpor
__m256i _mm256_or_si256 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_or_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpor ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.
Operation
dst[255:0] := (a[255:0] OR b[255:0]) dst[MAX:256] := 0
Performance
vpacksswb
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpacksswb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Convert packed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.
Operation
dst[7:0] := Saturate_Int16_To_Int8 (a[15:0]) dst[15:8] := Saturate_Int16_To_Int8 (a[31:16]) dst[23:16] := Saturate_Int16_To_Int8 (a[47:32]) dst[31:24] := Saturate_Int16_To_Int8 (a[63:48]) dst[39:32] := Saturate_Int16_To_Int8 (a[79:64]) dst[47:40] := Saturate_Int16_To_Int8 (a[95:80]) dst[55:48] := Saturate_Int16_To_Int8 (a[111:96]) dst[63:56] := Saturate_Int16_To_Int8 (a[127:112]) dst[71:64] := Saturate_Int16_To_Int8 (b[15:0]) dst[79:72] := Saturate_Int16_To_Int8 (b[31:16]) dst[87:80] := Saturate_Int16_To_Int8 (b[47:32]) dst[95:88] := Saturate_Int16_To_Int8 (b[63:48]) dst[103:96] := Saturate_Int16_To_Int8 (b[79:64]) dst[111:104] := Saturate_Int16_To_Int8 (b[95:80]) dst[119:112] := Saturate_Int16_To_Int8 (b[111:96]) dst[127:120] := Saturate_Int16_To_Int8 (b[127:112]) dst[135:128] := Saturate_Int16_To_Int8 (a[143:128]) dst[143:136] := Saturate_Int16_To_Int8 (a[159:144]) dst[151:144] := Saturate_Int16_To_Int8 (a[175:160]) dst[159:152] := Saturate_Int16_To_Int8 (a[191:176]) dst[167:160] := Saturate_Int16_To_Int8 (a[207:192]) dst[175:168] := Saturate_Int16_To_Int8 (a[223:208]) dst[183:176] := Saturate_Int16_To_Int8 (a[239:224]) dst[191:184] := Saturate_Int16_To_Int8 (a[255:240]) dst[199:192] := Saturate_Int16_To_Int8 (b[143:128]) dst[207:200] := Saturate_Int16_To_Int8 (b[159:144]) dst[215:208] := Saturate_Int16_To_Int8 (b[175:160]) dst[223:216] := Saturate_Int16_To_Int8 (b[191:176]) dst[231:224] := Saturate_Int16_To_Int8 (b[207:192]) dst[239:232] := Saturate_Int16_To_Int8 (b[223:208]) dst[247:240] := Saturate_Int16_To_Int8 (b[239:224]) dst[255:248] := Saturate_Int16_To_Int8 (b[255:240]) dst[MAX:256] := 0
Performance
vpackssdw
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackssdw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Convert packed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.
Operation
dst[15:0] := Saturate_Int32_To_Int16 (a[31:0]) dst[31:16] := Saturate_Int32_To_Int16 (a[63:32]) dst[47:32] := Saturate_Int32_To_Int16 (a[95:64]) dst[63:48] := Saturate_Int32_To_Int16 (a[127:96]) dst[79:64] := Saturate_Int32_To_Int16 (b[31:0]) dst[95:80] := Saturate_Int32_To_Int16 (b[63:32]) dst[111:96] := Saturate_Int32_To_Int16 (b[95:64]) dst[127:112] := Saturate_Int32_To_Int16 (b[127:96]) dst[143:128] := Saturate_Int32_To_Int16 (a[159:128]) dst[159:144] := Saturate_Int32_To_Int16 (a[191:160]) dst[175:160] := Saturate_Int32_To_Int16 (a[223:192]) dst[191:176] := Saturate_Int32_To_Int16 (a[255:224]) dst[207:192] := Saturate_Int32_To_Int16 (b[159:128]) dst[223:208] := Saturate_Int32_To_Int16 (b[191:160]) dst[239:224] := Saturate_Int32_To_Int16 (b[223:192]) dst[255:240] := Saturate_Int32_To_Int16 (b[255:224]) dst[MAX:256] := 0
Performance
vpackuswb
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackuswb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.
Operation
dst[7:0] := Saturate_Int16_To_UnsignedInt8 (a[15:0]) dst[15:8] := Saturate_Int16_To_UnsignedInt8 (a[31:16]) dst[23:16] := Saturate_Int16_To_UnsignedInt8 (a[47:32]) dst[31:24] := Saturate_Int16_To_UnsignedInt8 (a[63:48]) dst[39:32] := Saturate_Int16_To_UnsignedInt8 (a[79:64]) dst[47:40] := Saturate_Int16_To_UnsignedInt8 (a[95:80]) dst[55:48] := Saturate_Int16_To_UnsignedInt8 (a[111:96]) dst[63:56] := Saturate_Int16_To_UnsignedInt8 (a[127:112]) dst[71:64] := Saturate_Int16_To_UnsignedInt8 (b[15:0]) dst[79:72] := Saturate_Int16_To_UnsignedInt8 (b[31:16]) dst[87:80] := Saturate_Int16_To_UnsignedInt8 (b[47:32]) dst[95:88] := Saturate_Int16_To_UnsignedInt8 (b[63:48]) dst[103:96] := Saturate_Int16_To_UnsignedInt8 (b[79:64]) dst[111:104] := Saturate_Int16_To_UnsignedInt8 (b[95:80]) dst[119:112] := Saturate_Int16_To_UnsignedInt8 (b[111:96]) dst[127:120] := Saturate_Int16_To_UnsignedInt8 (b[127:112]) dst[135:128] := Saturate_Int16_To_UnsignedInt8 (a[143:128]) dst[143:136] := Saturate_Int16_To_UnsignedInt8 (a[159:144]) dst[151:144] := Saturate_Int16_To_UnsignedInt8 (a[175:160]) dst[159:152] := Saturate_Int16_To_UnsignedInt8 (a[191:176]) dst[167:160] := Saturate_Int16_To_UnsignedInt8 (a[207:192]) dst[175:168] := Saturate_Int16_To_UnsignedInt8 (a[223:208]) dst[183:176] := Saturate_Int16_To_UnsignedInt8 (a[239:224]) dst[191:184] := Saturate_Int16_To_UnsignedInt8 (a[255:240]) dst[199:192] := Saturate_Int16_To_UnsignedInt8 (b[143:128]) dst[207:200] := Saturate_Int16_To_UnsignedInt8 (b[159:144]) dst[215:208] := Saturate_Int16_To_UnsignedInt8 (b[175:160]) dst[223:216] := Saturate_Int16_To_UnsignedInt8 (b[191:176]) dst[231:224] := Saturate_Int16_To_UnsignedInt8 (b[207:192]) dst[239:232] := Saturate_Int16_To_UnsignedInt8 (b[223:208]) dst[247:240] := Saturate_Int16_To_UnsignedInt8 (b[239:224]) dst[255:248] := Saturate_Int16_To_UnsignedInt8 (b[255:240]) dst[MAX:256] := 0
Performance
vpackusdw
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackusdw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Convert packed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.
Operation
dst[15:0] := Saturate_Int32_To_UnsignedInt16 (a[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (a[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (a[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (a[127:96]) dst[79:64] := Saturate_Int32_To_UnsignedInt16 (b[31:0]) dst[95:80] := Saturate_Int32_To_UnsignedInt16 (b[63:32]) dst[111:96] := Saturate_Int32_To_UnsignedInt16 (b[95:64]) dst[127:112] := Saturate_Int32_To_UnsignedInt16 (b[127:96]) dst[143:128] := Saturate_Int32_To_UnsignedInt16 (a[159:128]) dst[159:144] := Saturate_Int32_To_UnsignedInt16 (a[191:160]) dst[175:160] := Saturate_Int32_To_UnsignedInt16 (a[223:192]) dst[191:176] := Saturate_Int32_To_UnsignedInt16 (a[255:224]) dst[207:192] := Saturate_Int32_To_UnsignedInt16 (b[159:128]) dst[223:208] := Saturate_Int32_To_UnsignedInt16 (b[191:160]) dst[239:224] := Saturate_Int32_To_UnsignedInt16 (b[223:192]) dst[255:240] := Saturate_Int32_To_UnsignedInt16 (b[255:224]) dst[MAX:256] := 0
Performance
vpermilpd
__m128d _mm_permute_pd (__m128d a, int imm8)
Synopsis
__m128d _mm_permute_pd (__m128d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, imm
CPUID Flags: AVX
Description
Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8, and store the results in dst.
Operation
IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0
Performance
vpermilpd
__m256d _mm256_permute_pd (__m256d a, int imm8)
Synopsis
__m256d _mm256_permute_pd (__m256d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.
Operation
IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] IF (imm8[2] == 0) dst[191:128] := a[191:128] IF (imm8[2] == 1) dst[191:128] := a[255:192] IF (imm8[3] == 0) dst[255:192] := a[191:128] IF (imm8[3] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0
Performance
vpermilps
__m128 _mm_permute_ps (__m128 a, int imm8)
Synopsis
__m128 _mm_permute_ps (__m128 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, imm
CPUID Flags: AVX
Description
Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[MAX:128] := 0
Performance
vpermilps
__m256 _mm256_permute_ps (__m256 a, int imm8)
Synopsis
__m256 _mm256_permute_ps (__m256 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0
Performance
vperm2f128
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
Synopsis
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.
Operation
SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0
Performance
vperm2f128
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
Synopsis
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.
Operation
SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0
Performance
vperm2f128
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
Synopsis
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.
Operation
SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0
Performance
vperm2i128
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
Synopsis
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vperm2i128 ymm, ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.
Operation
SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0
Performance
vpermq
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpermq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0
Performance
vpermpd
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
Synopsis
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vpermpd ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0
Performance
vpermilpd
__m128d _mm_permutevar_pd (__m128d a, __m128i b)
Synopsis
__m128d _mm_permutevar_pd (__m128d a, __m128i b)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, xmm
CPUID Flags: AVX
Description
Shuffle double-precision (64-bit) floating-point elements in a using the control in b, and store the results in dst.
Operation
IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0
Performance
vpermilpd
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
Synopsis
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.
Operation
IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] IF (b[129] == 0) dst[191:128] := a[191:128] IF (b[129] == 1) dst[191:128] := a[255:192] IF (b[193] == 0) dst[255:192] := a[191:128] IF (b[193] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0
Performance
vpermilps
__m128 _mm_permutevar_ps (__m128 a, __m128i b)
Synopsis
__m128 _mm_permutevar_ps (__m128 a, __m128i b)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, xmm
CPUID Flags: AVX
Description
Shuffle single-precision (32-bit) floating-point elements in a using the control in b, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[MAX:128] := 0
Performance
vpermilps
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
Synopsis
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, ymm
CPUID Flags: AVX
Description
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[159:128] := SELECT4(a[255:128], b[129:128]) dst[191:160] := SELECT4(a[255:128], b[161:160]) dst[223:192] := SELECT4(a[255:128], b[193:192]) dst[255:224] := SELECT4(a[255:128], b[225:224]) dst[MAX:256] := 0
Performance
vpermd
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
Synopsis
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
#include «immintrin.h»
Instruction: vpermd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0
Performance
vpermps
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
Synopsis
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
#include «immintrin.h»
Instruction: vpermps ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.
Operation
FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0
Performance
vrcpps
__m256 _mm256_rcp_ps (__m256 a)
Synopsis
__m256 _mm256_rcp_ps (__m256 a)
#include «immintrin.h»
Instruction: vrcpps ymm, ymm
CPUID Flags: AVX
Description
Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0/a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vroundpd
__m256d _mm256_round_pd (__m256d a, int rounding)
Synopsis
__m256d _mm256_round_pd (__m256d a, int rounding)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed double-precision (64-bit) floating-point elements in
a using the
rounding parameter, and store the results as packed double-precision floating-point elements in
dst.
Rounding is done according to the
rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ROUND(a[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vroundps
__m256 _mm256_round_ps (__m256 a, int rounding)
Synopsis
__m256 _mm256_round_ps (__m256 a, int rounding)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX
Description
Round the packed single-precision (32-bit) floating-point elements in
a using the
rounding parameter, and store the results as packed single-precision floating-point elements in
dst.
Rounding is done according to the
rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ROUND(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vrsqrtps
__m256 _mm256_rsqrt_ps (__m256 a)
Synopsis
__m256 _mm256_rsqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vrsqrtps ymm, ymm
CPUID Flags: AVX
Description
Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0 / SQRT(a[i+31:i])) ENDFOR dst[MAX:256] := 0
Performance
vpsadbw
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsadbw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
Operation
FOR j := 0 to 31 i := j*8 tmp[i+7:i] := ABS(a[i+7:i] — b[i+7:i]) ENDFOR FOR j := 0 to 4 i := j*64 dst[i+15:i] := tmp[i+7:i] + tmp[i+15:i+8] + tmp[i+23:i+16] + tmp[i+31:i+24] + tmp[i+39:i+32] + tmp[i+47:i+40] + tmp[i+55:i+48] + tmp[i+63:i+56] dst[i+63:i+16] := 0 ENDFOR dst[MAX:256] := 0
Performance
…
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
Synopsis
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 16-bit integers in dst with the supplied values.
Operation
dst[15:0] := e0 dst[31:16] := e1 dst[47:32] := e2 dst[63:48] := e3 dst[79:64] := e4 dst[95:80] := e5 dst[111:96] := e6 dst[127:112] := e7 dst[145:128] := e8 dst[159:144] := e9 dst[175:160] := e10 dst[191:176] := e11 dst[207:192] := e12 dst[223:208] := e13 dst[239:224] := e14 dst[255:240] := e15 dst[MAX:256] := 0
…
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
Synopsis
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 32-bit integers in dst with the supplied values.
Operation
dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
…
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
Synopsis
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 64-bit integers in dst with the supplied values.
Operation
dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
…
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
Synopsis
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 8-bit integers in dst with the supplied values in reverse order.
Operation
dst[7:0] := e0 dst[15:8] := e1 dst[23:16] := e2 dst[31:24] := e3 dst[39:32] := e4 dst[47:40] := e5 dst[55:48] := e6 dst[63:56] := e7 dst[71:64] := e8 dst[79:72] := e9 dst[87:80] := e10 dst[95:88] := e11 dst[103:96] := e12 dst[111:104] := e13 dst[119:112] := e14 dst[127:120] := e15 dst[135:128] := e16 dst[143:136] := e17 dst[151:144] := e18 dst[159:152] := e19 dst[167:160] := e20 dst[175:168] := e21 dst[183:176] := e22 dst[191:184] := e23 dst[199:192] := e24 dst[207:200] := e25 dst[215:208] := e26 dst[223:216] := e27 dst[231:224] := e28 dst[239:232] := e29 dst[247:240] := e30 dst[255:248] := e31 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
Synopsis
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256 vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
vinsertf128
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
Synopsis
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256d vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
vinsertf128
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
Synopsis
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256i vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
…
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
Synopsis
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.
Operation
dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
…
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
Synopsis
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.
Operation
dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
…
__m256i _mm256_set1_epi16 (short a)
Synopsis
__m256i _mm256_set1_epi16 (short a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
…
__m256i _mm256_set1_epi32 (int a)
Synopsis
__m256i _mm256_set1_epi32 (int a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
…
__m256i _mm256_set1_epi64x (long long a)
Synopsis
__m256i _mm256_set1_epi64x (long long a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
…
__m256i _mm256_set1_epi8 (char a)
Synopsis
__m256i _mm256_set1_epi8 (char a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
…
__m256d _mm256_set1_pd (double a)
Synopsis
__m256d _mm256_set1_pd (double a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast double-precision (64-bit) floating-point value a to all elements of dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
…
__m256 _mm256_set1_ps (float a)
Synopsis
__m256 _mm256_set1_ps (float a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Broadcast single-precision (32-bit) floating-point value a to all elements of dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
…
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
Synopsis
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 16-bit integers in dst with the supplied values in reverse order.
Operation
dst[15:0] := e15 dst[31:16] := e14 dst[47:32] := e13 dst[63:48] := e12 dst[79:64] := e11 dst[95:80] := e10 dst[111:96] := e9 dst[127:112] := e8 dst[145:128] := e7 dst[159:144] := e6 dst[175:160] := e5 dst[191:176] := e4 dst[207:192] := e3 dst[223:208] := e2 dst[239:224] := e1 dst[255:240] := e0 dst[MAX:256] := 0
…
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
Synopsis
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 32-bit integers in dst with the supplied values in reverse order.
Operation
dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
…
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
Synopsis
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 64-bit integers in dst with the supplied values in reverse order.
Operation
dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
…
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
Synopsis
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed 8-bit integers in dst with the supplied values in reverse order.
Operation
dst[7:0] := e31 dst[15:8] := e30 dst[23:16] := e29 dst[31:24] := e28 dst[39:32] := e27 dst[47:40] := e26 dst[55:48] := e25 dst[63:56] := e24 dst[71:64] := e23 dst[79:72] := e22 dst[87:80] := e21 dst[95:88] := e20 dst[103:96] := e19 dst[111:104] := e18 dst[119:112] := e17 dst[127:120] := e16 dst[135:128] := e15 dst[143:136] := e14 dst[151:144] := e13 dst[159:152] := e12 dst[167:160] := e11 dst[175:168] := e10 dst[183:176] := e9 dst[191:184] := e8 dst[199:192] := e7 dst[207:200] := e6 dst[215:208] := e5 dst[223:216] := e4 dst[231:224] := e3 dst[239:232] := e2 dst[247:240] := e1 dst[255:248] := e0 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
Synopsis
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256 vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
vinsertf128
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
Synopsis
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256d vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
vinsertf128
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
Synopsis
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX
Description
Set packed __m256i vector dst with the supplied values.
Operation
dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0
Performance
…
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
Synopsis
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.
Operation
dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
…
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
Synopsis
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX
Description
Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.
Operation
dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
vxorpd
__m256d _mm256_setzero_pd (void)
Synopsis
__m256d _mm256_setzero_pd (void)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Return vector of type __m256d with all elements set to zero.
Operation
dst[MAX:0] := 0
Performance
vxorps
__m256 _mm256_setzero_ps (void)
Synopsis
__m256 _mm256_setzero_ps (void)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX
Description
Return vector of type __m256 with all elements set to zero.
Operation
dst[MAX:0] := 0
Performance
vpxor
__m256i _mm256_setzero_si256 (void)
Synopsis
__m256i _mm256_setzero_si256 (void)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX
Description
Return vector of type __m256i with all elements set to zero.
Operation
dst[MAX:0] := 0
Performance
vpshufd
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufd ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0
Performance
vpshufb
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpshufb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR dst[MAX:256] := 0
Performance
vshufpd
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
Synopsis
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vshufpd ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.
Operation
dst[63:0] := (imm8[0] == 0) ? a[63:0] : a[127:64] dst[127:64] := (imm8[1] == 0) ? b[63:0] : b[127:64] dst[191:128] := (imm8[2] == 0) ? a[191:128] : a[255:192] dst[255:192] := (imm8[3] == 0) ? b[191:128] : b[255:192] dst[MAX:256] := 0
Performance
vshufps
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
Synopsis
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vshufps ymm, ymm, ymm, imm
CPUID Flags: AVX
Description
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.
Operation
SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(b[127:0], imm8[5:4]) dst[127:96] := SELECT4(b[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(b[255:128], imm8[5:4]) dst[255:224] := SELECT4(b[255:128], imm8[7:6]) dst[MAX:256] := 0
Performance
vpshufhw
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufhw ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.
Operation
dst[63:0] := a[63:0] dst[79:64] := (a >> (imm8[1:0] * 16))[79:64] dst[95:80] := (a >> (imm8[3:2] * 16))[79:64] dst[111:96] := (a >> (imm8[5:4] * 16))[79:64] dst[127:112] := (a >> (imm8[7:6] * 16))[79:64] dst[191:128] := a[191:128] dst[207:192] := (a >> (imm8[1:0] * 16))[207:192] dst[223:208] := (a >> (imm8[3:2] * 16))[207:192] dst[239:224] := (a >> (imm8[5:4] * 16))[207:192] dst[255:240] := (a >> (imm8[7:6] * 16))[207:192] dst[MAX:256] := 0
Performance
vpshuflw
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshuflw ymm, ymm, imm
CPUID Flags: AVX2
Description
Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.
Operation
dst[15:0] := (a >> (imm8[1:0] * 16))[15:0] dst[31:16] := (a >> (imm8[3:2] * 16))[15:0] dst[47:32] := (a >> (imm8[5:4] * 16))[15:0] dst[63:48] := (a >> (imm8[7:6] * 16))[15:0] dst[127:64] := a[127:64] dst[143:128] := (a >> (imm8[1:0] * 16))[143:128] dst[159:144] := (a >> (imm8[3:2] * 16))[143:128] dst[175:160] := (a >> (imm8[5:4] * 16))[143:128] dst[191:176] := (a >> (imm8[7:6] * 16))[143:128] dst[255:192] := a[255:192] dst[MAX:256] := 0
Performance
vpsignw
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Negate packed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.
Operation
FOR j := 0 to 15 i := j*16 IF b[i+15:i] < 0 dst[i+15:i] := NEG(a[i+15:i]) ELSE IF b[i+15:i] = 0 dst[i+15:i] := 0 ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpsignd
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Negate packed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.
Operation
FOR j := 0 to 7 i := j*32 IF b[i+31:i] < 0 dst[i+31:i] := NEG(a[i+31:i]) ELSE IF b[i+31:i] = 0 dst[i+31:i] := 0 ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpsignb
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Negate packed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.
Operation
FOR j := 0 to 31 i := j*8 IF b[i+7:i] < 0 dst[i+7:i] := NEG(a[i+7:i]) ELSE IF b[i+7:i] = 0 dst[i+7:i] := 0 ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0
Performance
vpsllw
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpslld
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsllq
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsllw
__m256i _mm256_slli_epi16 (__m256i a, int imm8)
Synopsis
__m256i _mm256_slli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpslld
__m256i _mm256_slli_epi32 (__m256i a, int imm8)
Synopsis
__m256i _mm256_slli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsllq
__m256i _mm256_slli_epi64 (__m256i a, int imm8)
Synopsis
__m256i _mm256_slli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpslldq
__m256i _mm256_slli_si256 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_slli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.
Operation
tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0
Performance
vpsllvd
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
Synopsis
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvd xmm, xmm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:128] := 0
Performance
vpsllvd
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
Synopsis
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsllvq
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
Synopsis
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvq xmm, xmm, xmm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:128] := 0
Performance
vpsllvq
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
Synopsis
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vsqrtpd
__m256d _mm256_sqrt_pd (__m256d a)
Synopsis
__m256d _mm256_sqrt_pd (__m256d a)
#include «immintrin.h»
Instruction: vsqrtpd ymm, ymm
CPUID Flags: AVX
Description
Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := SQRT(a[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vsqrtps
__m256 _mm256_sqrt_ps (__m256 a)
Synopsis
__m256 _mm256_sqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vsqrtps ymm, ymm
CPUID Flags: AVX
Description
Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := SQRT(a[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsraw
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrad
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsraw
__m256i _mm256_srai_epi16 (__m256i a, int imm8)
Synopsis
__m256i _mm256_srai_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrad
__m256i _mm256_srai_epi32 (__m256i a, int imm8)
Synopsis
__m256i _mm256_srai_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsravd
__m128i _mm_srav_epi32 (__m128i a, __m128i count)
Synopsis
__m128i _mm_srav_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsravd xmm, xmm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0
Performance
vpsravd
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
Synopsis
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsravd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsrlw
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrld
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrlq
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
Synopsis
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, xmm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrlw
__m256i _mm256_srli_epi16 (__m256i a, int imm8)
Synopsis
__m256i _mm256_srli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrld
__m256i _mm256_srli_epi32 (__m256i a, int imm8)
Synopsis
__m256i _mm256_srli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrlq
__m256i _mm256_srli_epi64 (__m256i a, int imm8)
Synopsis
__m256i _mm256_srli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0
Performance
vpsrldq
__m256i _mm256_srli_si256 (__m256i a, const int imm8)
Synopsis
__m256i _mm256_srli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2
Description
Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.
Operation
tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0
Performance
vpsrlvd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
Synopsis
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvd xmm, xmm, xmm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0
Performance
vpsrlvd
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
Synopsis
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsrlvq
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
Synopsis
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvq xmm, xmm, xmm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:128] := 0
Performance
vpsrlvq
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
Synopsis
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:256] := 0
Performance
vmovapd
void _mm256_store_pd (double * mem_addr, __m256d a)
Synopsis
void _mm256_store_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovapd m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovaps
void _mm256_store_ps (float * mem_addr, __m256 a)
Synopsis
void _mm256_store_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovaps m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqa
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
Synopsis
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqa m256, ymm
CPUID Flags: AVX
Description
Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovupd
void _mm256_storeu_pd (double * mem_addr, __m256d a)
Synopsis
void _mm256_storeu_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovupd m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovups
void _mm256_storeu_ps (float * mem_addr, __m256 a)
Synopsis
void _mm256_storeu_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovups m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqu
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
Synopsis
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqu m256, ymm
CPUID Flags: AVX
Description
Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
…
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
Synopsis
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
…
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
Synopsis
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
…
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
Synopsis
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
#include «immintrin.h»
CPUID Flags: AVX
Description
Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Operation
MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
vmovntdqa
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
Synopsis
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
#include «immintrin.h»
Instruction: vmovntdqa ymm, m256
CPUID Flags: AVX2
Description
Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovntpd
void _mm256_stream_pd (double * mem_addr, __m256d a)
Synopsis
void _mm256_stream_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovntpd m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntps
void _mm256_stream_ps (float * mem_addr, __m256 a)
Synopsis
void _mm256_stream_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovntps m256, ymm
CPUID Flags: AVX
Description
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntdq
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
Synopsis
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovntdq m256, ymm
CPUID Flags: AVX
Description
Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Operation
MEM[mem_addr+255:mem_addr] := a[255:0]
vpsubw
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] — b[i+15:i] ENDFOR dst[MAX:256] := 0
Performance
vpsubd
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpsubq
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vpsubb
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] — b[i+7:i] ENDFOR dst[MAX:256] := 0
Performance
vsubpd
__m256d _mm256_sub_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_sub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vsubpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vsubps
__m256 _mm256_sub_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_sub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vsubps ymm, ymm, ymm
CPUID Flags: AVX
Description
Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpsubsw
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsubsb
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsubusw
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.
Operation
FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0
Performance
vpsubusb
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusb ymm, ymm, ymm
CPUID Flags: AVX2
Description
Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.
Operation
FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0
Performance
vtestpd
int _mm_testc_pd (__m128d a, __m128d b)
Synopsis
int _mm_testc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF
Performance
vtestpd
int _mm256_testc_pd (__m256d a, __m256d b)
Synopsis
int _mm256_testc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF
Performance
vtestps
int _mm_testc_ps (__m128 a, __m128 b)
Synopsis
int _mm_testc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF
Performance
vtestps
int _mm256_testc_ps (__m256 a, __m256 b)
Synopsis
int _mm256_testc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF
Performance
vptest
int _mm256_testc_si256 (__m256i a, __m256i b)
Synopsis
int _mm256_testc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.
Operation
IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN CF
Performance
vtestpd
int _mm_testnzc_pd (__m128d a, __m128d b)
Synopsis
int _mm_testnzc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI
Performance
vtestpd
int _mm256_testnzc_pd (__m256d a, __m256d b)
Synopsis
int _mm256_testnzc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI
Performance
vtestps
int _mm_testnzc_ps (__m128 a, __m128 b)
Synopsis
int _mm_testnzc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI
Performance
vtestps
int _mm256_testnzc_ps (__m256 a, __m256 b)
Synopsis
int _mm256_testnzc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI
Performance
vptest
int _mm256_testnzc_si256 (__m256i a, __m256i b)
Synopsis
int _mm256_testnzc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.
Operation
IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI
Performance
vtestpd
int _mm_testz_pd (__m128d a, __m128d b)
Synopsis
int _mm_testz_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF
Performance
vtestpd
int _mm256_testz_pd (__m256d a, __m256d b)
Synopsis
int _mm256_testz_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF
Performance
vtestps
int _mm_testz_ps (__m128 a, __m128 b)
Synopsis
int _mm_testz_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX
Description
Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.
Operation
tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF
Performance
vtestps
int _mm256_testz_ps (__m256 a, __m256 b)
Synopsis
int _mm256_testz_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.
Operation
tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF
Performance
vptest
int _mm256_testz_si256 (__m256i a, __m256i b)
Synopsis
int _mm256_testz_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.
Operation
IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF
Performance
__m256d _mm256_undefined_pd (void)
Synopsis
__m256d _mm256_undefined_pd (void)
#include «immintrin.h»
CPUID Flags: AVX
Description
Return vector of type __m256d with undefined elements.
__m256 _mm256_undefined_ps (void)
Synopsis
__m256 _mm256_undefined_ps (void)
#include «immintrin.h»
CPUID Flags: AVX
Description
Return vector of type __m256 with undefined elements.
__m256i _mm256_undefined_si256 (void)
Synopsis
__m256i _mm256_undefined_si256 (void)
#include «immintrin.h»
CPUID Flags: AVX
Description
Return vector of type __m256i with undefined elements.
vpunpckhwd
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhwd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[79:64] dst[31:16] := src2[79:64] dst[47:32] := src1[95:80] dst[63:48] := src2[95:80] dst[79:64] := src1[111:96] dst[95:80] := src2[111:96] dst[111:96] := src1[127:112] dst[127:112] := src2[127:112] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpckhdq
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhdq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpckhqdq
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhqdq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpckhbw
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhbw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[71:64] dst[15:8] := src2[71:64] dst[23:16] := src1[79:72] dst[31:24] := src2[79:72] dst[39:32] := src1[87:80] dst[47:40] := src2[87:80] dst[55:48] := src1[95:88] dst[63:56] := src2[95:88] dst[71:64] := src1[103:96] dst[79:72] := src2[103:96] dst[87:80] := src1[111:104] dst[95:88] := src2[111:104] dst[103:96] := src1[119:112] dst[111:104] := src2[119:112] dst[119:112] := src1[127:120] dst[127:120] := src2[127:120] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vunpckhpd
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpckhpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vunpckhps
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpckhps ymm, ymm, ymm
CPUID Flags: AVX
Description
Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpcklwd
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[15:0] dst[31:16] := src2[15:0] dst[47:32] := src1[31:16] dst[63:48] := src2[31:16] dst[79:64] := src1[47:32] dst[95:80] := src2[47:32] dst[111:96] := src1[63:48] dst[127:112] := src2[63:48] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpckldq
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckldq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpcklqdq
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklqdq ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vpunpcklbw
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklbw ymm, ymm, ymm
CPUID Flags: AVX2
Description
Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[7:0] dst[15:8] := src2[7:0] dst[23:16] := src1[15:8] dst[31:24] := src2[15:8] dst[39:32] := src1[23:16] dst[47:40] := src2[23:16] dst[55:48] := src1[31:24] dst[63:56] := src2[31:24] dst[71:64] := src1[39:32] dst[79:72] := src2[39:32] dst[87:80] := src1[47:40] dst[95:88] := src2[47:40] dst[103:96] := src1[55:48] dst[111:104] := src2[55:48] dst[119:112] := src1[63:56] dst[127:120] := src2[63:56] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vunpcklpd
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpcklpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vunpcklps
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpcklps ymm, ymm, ymm
CPUID Flags: AVX
Description
Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.
Operation
INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0
Performance
vxorpd
__m256d _mm256_xor_pd (__m256d a, __m256d b)
Synopsis
__m256d _mm256_xor_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] XOR b[i+63:i] ENDFOR dst[MAX:256] := 0
Performance
vxorps
__m256 _mm256_xor_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_xor_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX
Description
Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] XOR b[i+31:i] ENDFOR dst[MAX:256] := 0
Performance
vpxor
__m256i _mm256_xor_si256 (__m256i a, __m256i b)
Synopsis
__m256i _mm256_xor_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX2
Description
Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
Operation
dst[255:0] := (a[255:0] XOR b[255:0]) dst[MAX:256] := 0
Performance
vzeroall
void _mm256_zeroall (void)
Synopsis
void _mm256_zeroall (void)
#include «immintrin.h»
Instruction: vzeroall
CPUID Flags: AVX
Description
Zero the contents of all XMM or YMM registers.
Operation
YMM0[MAX:0] := 0 YMM1[MAX:0] := 0 YMM2[MAX:0] := 0 YMM3[MAX:0] := 0 YMM4[MAX:0] := 0 YMM5[MAX:0] := 0 YMM6[MAX:0] := 0 YMM7[MAX:0] := 0 IF 64-bit mode YMM8[MAX:0] := 0 YMM9[MAX:0] := 0 YMM10[MAX:0] := 0 YMM11[MAX:0] := 0 YMM12[MAX:0] := 0 YMM13[MAX:0] := 0 YMM14[MAX:0] := 0 YMM15[MAX:0] := 0 FI
vzeroupper
void _mm256_zeroupper (void)
Synopsis
void _mm256_zeroupper (void)
#include «immintrin.h»
Instruction: vzeroupper
CPUID Flags: AVX
Description
Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.
Operation
YMM0[MAX:128] := 0 YMM1[MAX:128] := 0 YMM2[MAX:128] := 0 YMM3[MAX:128] := 0 YMM4[MAX:128] := 0 YMM5[MAX:128] := 0 YMM6[MAX:128] := 0 YMM7[MAX:128] := 0 IF 64-bit mode YMM8[MAX:128] := 0 YMM9[MAX:128] := 0 YMM10[MAX:128] := 0 YMM11[MAX:128] := 0 YMM12[MAX:128] := 0 YMM13[MAX:128] := 0 YMM14[MAX:128] := 0 YMM15[MAX:128] := 0 FI
Performance
Понравилось это:
Нравится Загрузка...
Похожее