AVX — Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD

Advanced Vector Extensions (AVX) — расширение системы команд x86 для микропроцессоров Intel и AMD.

AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.

Улучшения

  • Новая схема кодирования инструкций VEX
  • Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE-инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM-регистрами добавлены новые 256-битные AVX-инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Xeon Phi уже в 2012 году имели векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD-команды с MVEX- и VEX-префиксами, но при этом они не поддерживают AVX.
  • Неразрушающие операции. Набор AVX-инструкций использует трёхоперандный синтаксис. Например, вместо {\displaystyle a=a+b}a=a+b можно использовать {\displaystyle c=a+b}c=a+b, при этом регистр {\displaystyle a}a остаётся неизменённым. В случаях, когда значение {\displaystyle a}a используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший {\displaystyle a}a, из другого регистра или памяти.
  • Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.
  • Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX-инструкции сохраняют прочие преимущества AVX, такие, как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти.
  • Intel рекомендует отказаться от старых SSE инструкций в пользу новых 128-битных AVX-инструкций, даже если достаточно двух операндов.

Новая схема кодирования

Новая схема кодирования инструкций VEX использует VEX-префикс. В настоящий момент существуют два VEX-префикса, длиной 2 и 3 байта. Для 2-хбайтного VEX-префикса первый байт равен 0xC5, для 3-х байтного — 0xC4.

В 64-битном режиме первый байт VEX-префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS.

Длина существующих AVX-инструкций, вместе с VEX-префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.

Новые инструкции

Инструкция Описание
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128 Копирует 32-х-, 64-х- или 128-битный операнд из памяти во все элементы векторного регистра XMM или YMM.
VINSERTF128 Замещает младшую или старшую половину 256-битного регистра YMM значением 128-битного операнда. Другая часть регистра-получателя не изменяется.
VEXTRACTF128 Извлекает младшую или старшую половину 256-битного регистра YMM и копирует в 128-битный операнд-назначение.
VMASKMOVPS, VMASKMOVPD Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными.
VPERMILPS, VPERMILPD Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра).
VPERM2F128 Переставляет 4 128-битных элемента двух 256-битных регистров в 256-битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора.
VZEROALL Обнуляет все YMM-регистры и помечает их как неиспользуемые. Используется при переключении между 128-битным режимом и 256-битным.
VZEROUPPER Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-битным режимом и 256-битным.

Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)

  • PCLMULLQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 00]
  • PCLMULHQLQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 01]
  • PCLMULLQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 02]
  • PCLMULHQHQDQ xmmreg, xmmrm [rm: 66 0f 3a 44 /r 03]
  • PCLMULQDQ xmmreg, xmmrm, imm [rmi: 66 0f 3a 44 /r ib]

Применение

Подходит для интенсивных вычислений с плавающей точкой в мультимедиа-программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.

Инструкции и примеры

__m256i _mm256_abs_epi16 (__m256i a)

Synopsis

__m256i _mm256_abs_epi16 (__m256i a)
#include «immintrin.h»
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)

Synopsis

__m256i _mm256_abs_epi32 (__m256i a)
#include «immintrin.h»
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)

Synopsis

__m256i _mm256_abs_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddw
__m256i _mm256_add_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddd
__m256i _mm256_add_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 32-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddq
__m256i _mm256_add_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 64-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddb
__m256i _mm256_add_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddpd
__m256d _mm256_add_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_add_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddps
__m256 _mm256_add_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_add_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpaddsw
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddsb
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpaddusw
__m256i _mm256_adds_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpaddusb
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpaddusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vaddsubpd
__m256d _mm256_addsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_addsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vaddsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF (j is even) dst[i+63:i] := a[i+63:i] — b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] + b[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vaddsubps
__m256 _mm256_addsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_addsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vaddsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] — b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpalignr
__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)

Synopsis

__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int count)
#include «immintrin.h»
Instruction: vpalignr ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by count bytes, and store the low 16 bytes in dst.

Operation

FOR j := 0 to 1 i := j*128 tmp[255:0] := ((a[i+127:i] << 128) OR b[i+127:i]) >> (count[7:0]*8) dst[i+127:i] := tmp[127:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandpd
__m256d _mm256_and_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_and_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandps
__m256 _mm256_and_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_and_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := (a[i+31:i] AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpand
__m256i _mm256_and_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_and_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpand ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vandnpd
__m256d _mm256_andnot_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_andnot_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vandnpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ((NOT a[i+63:i]) AND b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vandnps
__m256 _mm256_andnot_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_andnot_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vandnps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ((NOT a[i+31:i]) AND b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpandn
__m256i _mm256_andnot_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_andnot_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpandn ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.

Operation

dst[255:0] := ((NOT a[255:0]) AND b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgw
__m256i _mm256_avg_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 16-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := (a[i+15:i] + b[i+15:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpavgb
__m256i _mm256_avg_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_avg_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpavgb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Average packed unsigned 8-bit integers in a and b, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := (a[i+7:i] + b[i+7:i] + 1) >> 1 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpblendw
__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 16-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpblendd
__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)

Synopsis

__m128i _mm_blend_epi32 (__m128i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd xmm, xmm, xmm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpblendd
__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vpblendd ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vblendpd
__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vblendpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[j%8] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vblendps
__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vblendps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[j%8] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
vpblendvb
__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)

Synopsis

__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)
#include «immintrin.h»
Instruction: vpblendvb ymm, ymm, ymm, ymm
CPUID Flags: AVX2

Description

Blend packed 8-bit integers from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 IF mask[i+7] dst[i+7:i] := b[i+7:i] ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vblendvpd
__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)

Synopsis

__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)
#include «immintrin.h»
Instruction: vblendvpd ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := b[i+63:i] ELSE dst[i+63:i] := a[i+63:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vblendvps
__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)

Synopsis

__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)
#include «immintrin.h»
Instruction: vblendvps ymm, ymm, ymm, ymm
CPUID Flags: AVX

Description

Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
Ivy Bridge 2 1
Sandy Bridge 2 1
vbroadcastf128
__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

Synopsis

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastf128
__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastf128 ymm, m128
CPUID Flags: AVX

Description

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

Operation

tmp[127:0] = MEM[mem_addr+127:mem_addr] dst[127:0] := tmp[127:0] dst[255:128] := tmp[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastsd
__m256d _mm256_broadcast_sd (double const * mem_addr)

Synopsis

__m256d _mm256_broadcast_sd (double const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, m64
CPUID Flags: AVX

Description

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

Operation

tmp[63:0] = MEM[mem_addr+63:mem_addr] FOR j := 0 to 3 i := j*64 dst[i+63:i] := tmp[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vbroadcastss
__m128 _mm_broadcast_ss (float const * mem_addr)

Synopsis

__m128 _mm_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss xmm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 3 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:128] := 0
vbroadcastss
__m256 _mm256_broadcast_ss (float const * mem_addr)

Synopsis

__m256 _mm256_broadcast_ss (float const * mem_addr)
#include «immintrin.h»
Instruction: vbroadcastss ymm, m32
CPUID Flags: AVX

Description

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

Operation

tmp[31:0] = MEM[mem_addr+31:mem_addr] FOR j := 0 to 7 i := j*32 dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Ivy Bridge 1
Sandy Bridge 1
vpbroadcastb
__m128i _mm_broadcastb_epi8 (__m128i a)

Synopsis

__m128i _mm_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastb
__m256i _mm256_broadcastb_epi8 (__m128i a)

Synopsis

__m256i _mm256_broadcastb_epi8 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastb ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 8-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastd
__m128i _mm_broadcastd_epi32 (__m128i a)

Synopsis

__m128i _mm_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastd
__m256i _mm256_broadcastd_epi32 (__m128i a)

Synopsis

__m256i _mm256_broadcastd_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 32-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastq
__m128i _mm_broadcastq_epi64 (__m128i a)

Synopsis

__m128i _mm_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastq
__m256i _mm256_broadcastq_epi64 (__m128i a)

Synopsis

__m256i _mm256_broadcastq_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastq ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 64-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
movddup
__m128d _mm_broadcastsd_pd (__m128d a)

Synopsis

__m128d _mm_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: movddup xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
Westmere 1
Nehalem 1
vbroadcastsd
__m256d _mm256_broadcastsd_pd (__m128d a)

Synopsis

__m256d _mm256_broadcastsd_pd (__m128d a)
#include «immintrin.h»
Instruction: vbroadcastsd ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcasti128
__m256i _mm256_broadcastsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_broadcastsi128_si256 (__m128i a)
#include «immintrin.h»
Instruction: vbroadcasti128 ymm, m128
CPUID Flags: AVX2

Description

Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.

Operation

dst[127:0] := a[127:0] dst[255:128] := a[127:0] dst[MAX:256] := 0
vbroadcastss
__m128 _mm_broadcastss_ps (__m128 a)

Synopsis

__m128 _mm_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
vbroadcastss
__m256 _mm256_broadcastss_ps (__m128 a)

Synopsis

__m256 _mm256_broadcastss_ps (__m128 a)
#include «immintrin.h»
Instruction: vbroadcastss ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpbroadcastw
__m128i _mm_broadcastw_epi16 (__m128i a)

Synopsis

__m128i _mm_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw xmm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpbroadcastw
__m256i _mm256_broadcastw_epi16 (__m128i a)

Synopsis

__m256i _mm256_broadcastw_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpbroadcastw ymm, xmm
CPUID Flags: AVX2

Description

Broadcast the low packed 16-bit integer from a to all elements of dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpslldq
__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
__m256 _mm256_castpd_ps (__m256d a)

Synopsis

__m256 _mm256_castpd_ps (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256d to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castpd_si256 (__m256d a)

Synopsis

__m256i _mm256_castpd_si256 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castpd128_pd256 (__m128d a)

Synopsis

__m256d _mm256_castpd128_pd256 (__m128d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128d _mm256_castpd256_pd128 (__m256d a)

Synopsis

__m128d _mm256_castpd256_pd128 (__m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256d to type __m128d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castps_pd (__m256 a)

Synopsis

__m256d _mm256_castps_pd (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Cast vector of type __m256 to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castps_si256 (__m256 a)

Synopsis

__m256i _mm256_castps_si256 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m256i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castps128_ps256 (__m128 a)

Synopsis

__m256 _mm256_castps128_ps256 (__m128 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128 to type __m256; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128 _mm256_castps256_ps128 (__m256 a)

Synopsis

__m128 _mm256_castps256_ps128 (__m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256 to type __m128. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256i _mm256_castsi128_si256 (__m128i a)

Synopsis

__m256i _mm256_castsi128_si256 (__m128i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256d _mm256_castsi256_pd (__m256i a)

Synopsis

__m256d _mm256_castsi256_pd (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256d. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m256 _mm256_castsi256_ps (__m256i a)

Synopsis

__m256 _mm256_castsi256_ps (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m256. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
__m128i _mm256_castsi256_si128 (__m256i a)

Synopsis

__m128i _mm256_castsi256_si128 (__m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Casts vector of type __m256i to type __m128i. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
vroundpd
__m256d _mm256_ceil_pd (__m256d a)

Synopsis

__m256d _mm256_ceil_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := CEIL(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_ceil_ps (__m256 a)

Synopsis

__m256 _mm256_ceil_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := CEIL(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmppd
__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 1 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmppd
__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vcmppd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] OP b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpps
__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 3 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpps
__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] OP b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcmpsd
__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)

Synopsis

__m128d _mm_cmp_sd (__m128d a, __m128d b, const int imm8)
#include «immintrin.h»
Instruction: vcmpsd xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower double-precision (64-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[63:0] := ( a[63:0] OP b[63:0] ) ? 0xFFFFFFFFFFFFFFFF : 0 dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vcmpss
__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)

Synopsis

__m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
#include «immintrin.h»
Instruction: vcmpss xmm, xmm, xmm, imm
CPUID Flags: AVX

Description

Compare the lower single-precision (32-bit) floating-point element in a and b based on the comparison operand specified by imm8, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.

Operation

CASE (imm8[7:0]) OF 0: OP := _CMP_EQ_OQ 1: OP := _CMP_LT_OS 2: OP := _CMP_LE_OS 3: OP := _CMP_UNORD_Q 4: OP := _CMP_NEQ_UQ 5: OP := _CMP_NLT_US 6: OP := _CMP_NLE_US 7: OP := _CMP_ORD_Q 8: OP := _CMP_EQ_UQ 9: OP := _CMP_NGE_US 10: OP := _CMP_NGT_US 11: OP := _CMP_FALSE_OQ 12: OP := _CMP_NEQ_OQ 13: OP := _CMP_GE_OS 14: OP := _CMP_GT_OS 15: OP := _CMP_TRUE_UQ 16: OP := _CMP_EQ_OS 17: OP := _CMP_LT_OQ 18: OP := _CMP_LE_OQ 19: OP := _CMP_UNORD_S 20: OP := _CMP_NEQ_US 21: OP := _CMP_NLT_UQ 22: OP := _CMP_NLE_UQ 23: OP := _CMP_ORD_S 24: OP := _CMP_EQ_US 25: OP := _CMP_NGE_UQ 26: OP := _CMP_NGT_UQ 27: OP := _CMP_FALSE_OS 28: OP := _CMP_NEQ_OS 29: OP := _CMP_GE_OQ 30: OP := _CMP_GT_OQ 31: OP := _CMP_TRUE_US ESAC dst[31:0] := ( a[31:0] OP b[31:0] ) ? 0xFFFFFFFF : 0 dst[127:32] := a[127:32] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 3
Sandy Bridge 3
vpcmpeqw
__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] == b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqd
__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqq
__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] == b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpeqb
__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpeqb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for equality, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] == b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpcmpgtw
__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ( a[i+15:i] > b[i+15:i] ) ? 0xFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtd
__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ( a[i+31:i] > b[i+31:i] ) ? 0xFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpcmpgtq
__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 64-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ( a[i+63:i] > b[i+63:i] ) ? 0xFFFFFFFFFFFFFFFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpcmpgtb
__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpcmpgtb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b for greater-than, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ( a[i+7:i] > b[i+7:i] ) ? 0xFF : 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmovsxwd
__m256i _mm256_cvtepi16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 7 i := 32*j k := 16*j dst[i+31:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxwq
__m256i _mm256_cvtepi16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxwq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := SignExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxdq
__m256i _mm256_cvtepi32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxdq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := SignExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtdq2pd
__m256d _mm256_cvtepi32_pd (__m128i a)

Synopsis

__m256d _mm256_cvtepi32_pd (__m128i a)
#include «immintrin.h»
Instruction: vcvtdq2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[m+63:m] := Convert_Int32_To_FP64(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtdq2ps
__m256 _mm256_cvtepi32_ps (__m256i a)

Synopsis

__m256 _mm256_cvtepi32_ps (__m256i a)
#include «immintrin.h»
Instruction: vcvtdq2ps ymm, ymm
CPUID Flags: AVX

Description

Convert packed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpmovsxbw
__m256i _mm256_cvtepi8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbw ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := SignExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbd
__m256i _mm256_cvtepi8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbd ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovsxbq
__m256i _mm256_cvtepi8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepi8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovsxbq ymm, xmm
CPUID Flags: AVX2

Description

Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := SignExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwd
__m256i _mm256_cvtepu16_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 16*j dst[i+31:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxwq
__m256i _mm256_cvtepu16_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu16_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxwq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 16*j dst[i+63:i] := ZeroExtend(a[k+15:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxdq
__m256i _mm256_cvtepu32_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu32_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxdq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.

Operation

FOR j:= 0 to 3 i := 64*j k := 32*j dst[i+63:i] := ZeroExtend(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbw
__m256i _mm256_cvtepu8_epi16 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi16 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbw ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 l := j*16 dst[l+15:l] := ZeroExtend(a[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbd
__m256i _mm256_cvtepu8_epi32 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi32 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbd ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j k := 8*j dst[i+31:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpmovzxbq
__m256i _mm256_cvtepu8_epi64 (__m128i a)

Synopsis

__m256i _mm256_cvtepu8_epi64 (__m128i a)
#include «immintrin.h»
Instruction: vpmovzxbq ymm, xmm
CPUID Flags: AVX2

Description

Zero extend packed unsigned 8-bit integers in the low 8 byte sof a to packed 64-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 8*j dst[i+63:i] := ZeroExtend(a[k+7:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vcvtpd2dq
__m128i _mm256_cvtpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvtpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtpd2ps
__m128 _mm256_cvtpd_ps (__m256d a)

Synopsis

__m128 _mm256_cvtpd_ps (__m256d a)
#include «immintrin.h»
Instruction: vcvtpd2ps xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_FP32(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvtps2dq
__m256i _mm256_cvtps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvtps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvtps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vcvtps2pd
__m256d _mm256_cvtps_pd (__m128 a)

Synopsis

__m256d _mm256_cvtps_pd (__m128 a)
#include «immintrin.h»
Instruction: vcvtps2pd ymm, xmm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j k := 32*j dst[i+63:i] := Convert_FP32_To_FP64(a[k+31:k]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 1
Ivy Bridge 2 1
Sandy Bridge 2 1
vcvttpd2dq
__m128i _mm256_cvttpd_epi32 (__m256d a)

Synopsis

__m128i _mm256_cvttpd_epi32 (__m256d a)
#include «immintrin.h»
Instruction: vcvttpd2dq xmm, ymm
CPUID Flags: AVX

Description

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 3 i := 32*j k := 64*j dst[i+31:i] := Convert_FP64_To_Int32_Truncate(a[k+63:k]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 4 1
Ivy Bridge 4 1
Sandy Bridge 4 1
vcvttps2dq
__m256i _mm256_cvttps_epi32 (__m256 a)

Synopsis

__m256i _mm256_cvttps_epi32 (__m256 a)
#include «immintrin.h»
Instruction: vcvttps2dq ymm, ymm
CPUID Flags: AVX

Description

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := Convert_FP32_To_Int32_Truncate(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vdivpd
__m256d _mm256_div_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_div_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vdivpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 3 i := 64*j dst[i+63:i] := a[i+63:i] / b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 25
Ivy Bridge 35 28
Sandy Bridge 43 44
vdivps
__m256 _mm256_div_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_div_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vdivps ymm, ymm, ymm
CPUID Flags: AVX

Description

Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.

Operation

FOR j := 0 to 7 i := 32*j dst[i+31:i] := a[i+31:i] / b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 13
Ivy Bridge 21 14
Sandy Bridge 29 28
vdpps
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vdpps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

Operation

DP(a[127:0], b[127:0], imm8[7:0]) { FOR j := 0 to 3 i := j*32 IF imm8[(4+j)%8] temp[i+31:i] := a[i+31:i] * b[i+31:i] ELSE temp[i+31:i] := 0 FI ENDFOR sum[31:0] := (temp[127:96] + temp[95:64]) + (temp[63:32] + temp[31:0]) FOR j := 0 to 3 i := j*32 IF imm8[j%8] tmpdst[i+31:i] := sum[31:0] ELSE tmpdst[i+31:i] := 0 FI ENDFOR RETURN tmpdst[127:0] } dst[127:0] := DP(a[127:0], b[127:0], imm8[7:0]) dst[255:128] := DP(a[255:128], b[255:128], imm8[7:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 14 2
Ivy Bridge 12 2
Sandy Bridge 12 2
__int16 _mm256_extract_epi16 (__m256i a, const int index)

Synopsis

__int16 _mm256_extract_epi16 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 16-bit integer from a, selected with index, and store the result in dst.

Operation

dst[15:0] := (a[255:0] >> (index * 16))[15:0]
__int32 _mm256_extract_epi32 (__m256i a, const int index)

Synopsis

__int32 _mm256_extract_epi32 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 32-bit integer from a, selected with index, and store the result in dst.

Operation

dst[31:0] := (a[255:0] >> (index * 32))[31:0]
__int64 _mm256_extract_epi64 (__m256i a, const int index)

Synopsis

__int64 _mm256_extract_epi64 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract a 64-bit integer from a, selected with index, and store the result in dst.

Operation

dst[63:0] := (a[255:0] >> (index * 64))[63:0]
__int8 _mm256_extract_epi8 (__m256i a, const int index)

Synopsis

__int8 _mm256_extract_epi8 (__m256i a, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Extract an 8-bit integer from a, selected with index, and store the result in dst.

Operation

dst[7:0] := (a[255:0] >> (index * 8))[7:0]
vextractf128
__m128d _mm256_extractf128_pd (__m256d a, const int imm8)

Synopsis

__m128d _mm256_extractf128_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128 _mm256_extractf128_ps (__m256 a, const int imm8)

Synopsis

__m128 _mm256_extractf128_ps (__m256 a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextractf128
__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextractf128 xmm, ymm, imm
CPUID Flags: AVX

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vextracti128
__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)

Synopsis

__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vextracti128 xmm, ymm, imm
CPUID Flags: AVX2

Description

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

Operation

CASE imm8[7:0] of 0: dst[127:0] := a[127:0] 1: dst[127:0] := a[255:128] ESAC dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vroundpd
__m256d _mm256_floor_pd (__m256d a)

Synopsis

__m256d _mm256_floor_pd (__m256d a)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := FLOOR(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_floor_ps (__m256 a)

Synopsis

__m256 _mm256_floor_ps (__m256 a)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := FLOOR(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vphaddw
__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[31:16] + a[15:0] dst[31:16] := a[63:48] + a[47:32] dst[47:32] := a[95:80] + a[79:64] dst[63:48] := a[127:112] + a[111:96] dst[79:64] := b[31:16] + b[15:0] dst[95:80] := b[63:48] + b[47:32] dst[111:96] := b[95:80] + b[79:64] dst[127:112] := b[127:112] + b[111:96] dst[143:128] := a[159:144] + a[143:128] dst[159:144] := a[191:176] + a[175:160] dst[175:160] := a[223:208] + a[207:192] dst[191:176] := a[255:240] + a[239:224] dst[207:192] := b[127:112] + b[143:128] dst[223:208] := b[159:144] + b[175:160] dst[239:224] := b[191:176] + b[207:192] dst[255:240] := b[223:208] + b[239:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphaddd
__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vhaddpd
__m256d _mm256_hadd_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hadd_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhaddpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[127:64] + a[63:0] dst[127:64] := b[127:64] + b[63:0] dst[191:128] := a[255:192] + a[191:128] dst[255:192] := b[255:192] + b[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhaddps
__m256 _mm256_hadd_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hadd_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhaddps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[63:32] + a[31:0] dst[63:32] := a[127:96] + a[95:64] dst[95:64] := b[63:32] + b[31:0] dst[127:96] := b[127:96] + b[95:64] dst[159:128] := a[191:160] + a[159:128] dst[191:160] := a[255:224] + a[223:192] dst[223:192] := b[191:160] + b[159:128] dst[255:224] := b[255:224] + b[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphaddsw
__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphaddsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally add adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[31:16] + a[15:0]) dst[31:16] = Saturate_To_Int16(a[63:48] + a[47:32]) dst[47:32] = Saturate_To_Int16(a[95:80] + a[79:64]) dst[63:48] = Saturate_To_Int16(a[127:112] + a[111:96]) dst[79:64] = Saturate_To_Int16(b[31:16] + b[15:0]) dst[95:80] = Saturate_To_Int16(b[63:48] + b[47:32]) dst[111:96] = Saturate_To_Int16(b[95:80] + b[79:64]) dst[127:112] = Saturate_To_Int16(b[127:112] + b[111:96]) dst[143:128] = Saturate_To_Int16(a[159:144] + a[143:128]) dst[159:144] = Saturate_To_Int16(a[191:176] + a[175:160]) dst[175:160] = Saturate_To_Int16( a[223:208] + a[207:192]) dst[191:176] = Saturate_To_Int16(a[255:240] + a[239:224]) dst[207:192] = Saturate_To_Int16(b[127:112] + b[143:128]) dst[223:208] = Saturate_To_Int16(b[159:144] + b[175:160]) dst[239:224] = Saturate_To_Int16(b[191-160] + b[159-128]) dst[255:240] = Saturate_To_Int16(b[255:240] + b[239:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 2
vphsubw
__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

Operation

dst[15:0] := a[15:0] — a[31:16] dst[31:16] := a[47:32] — a[63:48] dst[47:32] := a[79:64] — a[95:80] dst[63:48] := a[111:96] — a[127:112] dst[79:64] := b[15:0] — b[31:16] dst[95:80] := b[47:32] — b[63:48] dst[111:96] := b[79:64] — b[95:80] dst[127:112] := b[111:96] — b[127:112] dst[143:128] := a[143:128] — a[159:144] dst[159:144] := a[175:160] — a[191:176] dst[175:160] := a[207:192] — a[223:208] dst[191:176] := a[239:224] — a[255:240] dst[207:192] := b[143:128] — b[159:144] dst[223:208] := b[175:160] — b[191:176] dst[239:224] := b[207:192] — b[223:208] dst[255:240] := b[239:224] — b[255:240] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vphsubd
__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vhsubpd
__m256d _mm256_hsub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_hsub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vhsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[63:0] := a[63:0] — a[127:64] dst[127:64] := b[63:0] — b[127:64] dst[191:128] := a[191:128] — a[255:192] dst[255:192] := b[191:128] — b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vhsubps
__m256 _mm256_hsub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_hsub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vhsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

Operation

dst[31:0] := a[31:0] — a[63:32] dst[63:32] := a[95:64] — a[127:96] dst[95:64] := b[31:0] — b[63:32] dst[127:96] := b[95:64] — b[127:96] dst[159:128] := a[159:128] — a[191:160] dst[191:160] := a[223:192] — a[255:224] dst[223:192] := b[159:128] — b[191:160] dst[255:224] := b[223:192] — b[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
Ivy Bridge 5
Sandy Bridge 5
vphsubsw
__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vphsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Horizontally subtract adjacent pairs of 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

Operation

dst[15:0]= Saturate_To_Int16(a[15:0] — a[31:16]) dst[31:16] = Saturate_To_Int16(a[47:32] — a[63:48]) dst[47:32] = Saturate_To_Int16(a[79:64] — a[95:80]) dst[63:48] = Saturate_To_Int16(a[111:96] — a[127:112]) dst[79:64] = Saturate_To_Int16(b[15:0] — b[31:16]) dst[95:80] = Saturate_To_Int16(b[47:32] — b[63:48]) dst[111:96] = Saturate_To_Int16(b[79:64] — b[95:80]) dst[127:112] = Saturate_To_Int16(b[111:96] — b[127:112]) dst[143:128]= Saturate_To_Int16(a[143:128] — a[159:144]) dst[159:144] = Saturate_To_Int16(a[175:160] — a[191:176]) dst[175:160] = Saturate_To_Int16(a[207:192] — a[223:208]) dst[191:176] = Saturate_To_Int16(a[239:224] — a[255:240]) dst[207:192] = Saturate_To_Int16(b[143:128] — b[159:144]) dst[223:208] = Saturate_To_Int16(b[175:160] — b[191:176]) dst[239:224] = Saturate_To_Int16(b[207:192] — b[223:208]) dst[255:240] = Saturate_To_Int16(b[239:224] — b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpgatherdd
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdd
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256imask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherdq
__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherdq ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdpd
__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdpd ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 m := j*32 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+31:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i32gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps xmm, vm32x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherdps
__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256mask, const int scale)

Synopsis

__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherdps ymm, vm32x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqd
__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0
vpgatherqd
__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0
vpgatherqq
__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128imask, const int scale)

Synopsis

__m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const* base_addr, __m128i vindex, __m128i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vpgatherqq
__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)

Synopsis

__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include «immintrin.h»
Instruction: vpgatherqq ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from srcwhen the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128dmask, const int scale)

Synopsis

__m128d _mm_mask_i64gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)

Synopsis

__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqpd
__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256dmask, const int scale)

Synopsis

__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqpd ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+63] := 0 ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR mask[MAX:256] := 0 dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)

Synopsis

__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

Synopsis

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps xmm, vm64x, xmm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 1 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:64] := 0 dst[MAX:64] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)

Synopsis

__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scaleshould be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
vgatherqps
__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128mask, const int scale)

Synopsis

__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)
#include «immintrin.h»
Instruction: vgatherqps ymm, vm64x, ymm
CPUID Flags: AVX2

Description

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3 i := j*32 m := j*64 IF mask[i+31] dst[i+31:i] := MEM[base_addr + SignExtend(vindex[m+63:m])*scale] mask[i+31] := 0 ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR mask[MAX:128] := 0 dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 6
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)

Synopsis

__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*16 dst[sel+15:sel] := i[15:0]
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)

Synopsis

__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*32 dst[sel+31:sel] := i[31:0]
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)

Synopsis

__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*64 dst[sel+63:sel] := i[63:0]
__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)

Synopsis

__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)
#include «immintrin.h»
CPUID Flags: AVX

Description

Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.

Operation

dst[255:0] := a[255:0] sel := index*8 dst[sel+7:sel] := i[7:0]
vinsertf128
__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)

Synopsis

__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE imm8[7:0] of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)

Synopsis

__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)

Synopsis

__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinserti128
__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)

Synopsis

__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
#include «immintrin.h»
Instruction: vinserti128 ymm, ymm, xmm, imm
CPUID Flags: AVX2

Description

Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.

Operation

dst[255:0] := a[255:0] CASE (imm8[1:0]) of 0: dst[127:0] := b[127:0] 1: dst[255:128] := b[127:0] ESAC dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vlddqu
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vlddqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovapd
__m256d _mm256_load_pd (double const * mem_addr)

Synopsis

__m256d _mm256_load_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovapd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovaps
__m256 _mm256_load_ps (float const * mem_addr)

Synopsis

__m256 _mm256_load_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovaps ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqa
__m256i _mm256_load_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_load_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqa ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovupd
__m256d _mm256_loadu_pd (double const * mem_addr)

Synopsis

__m256d _mm256_loadu_pd (double const * mem_addr)
#include «immintrin.h»
Instruction: vmovupd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovups
__m256 _mm256_loadu_ps (float const * mem_addr)

Synopsis

__m256 _mm256_loadu_ps (float const * mem_addr)
#include «immintrin.h»
Instruction: vmovups ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovdqu
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)

Synopsis

__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
#include «immintrin.h»
Instruction: vmovdqu ymm, m256
CPUID Flags: AVX

Description

Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)

Synopsis

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)

Synopsis

__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

Synopsis

__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)
#include «immintrin.h»
CPUID Flags: AVX

Description

Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

dst[127:0] := MEM[loaddr+127:loaddr] dst[255:128] := MEM[hiaddr+127:hiaddr] dst[MAX:256] := 0
vpmaddwd
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i+16]*b[i+31:i+16] + a[i+15:i]*b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaddubsw
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaddubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] ) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmaskmovd
__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi32 (int const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovd xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovd
__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovd ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)

Synopsis

__m128i _mm_maskload_epi64 (__int64 const* mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vpmaskmovq xmm, xmm, m128
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpmaskmovq
__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)

Synopsis

__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vpmaskmovq ymm, ymm, m256
CPUID Flags: AVX2

Description

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vmaskmovpd
__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)

Synopsis

__m128d _mm_maskload_pd (double const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovpd xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovpd
__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)

Synopsis

__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovpd ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] dst[i+63:i] := MEM[mem_addr+i+63:mem_addr+i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)

Synopsis

__m128 _mm_maskload_ps (float const * mem_addr, __m128i mask)
#include «immintrin.h»
Instruction: vmaskmovps xmm, xmm, m128
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vmaskmovps
__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)

Synopsis

__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)
#include «immintrin.h»
Instruction: vmaskmovps ymm, ymm, m256
CPUID Flags: AVX

Description

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
Ivy Bridge 2
Sandy Bridge 2
vpmaskmovd
void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi32 (int* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovd m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovd
void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovd m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)

Synopsis

void _mm_maskstore_epi64 (__int64* mem_addr, __m128i mask, __m128i a)
#include «immintrin.h»
Instruction: vpmaskmovq m128, xmm, xmm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vpmaskmovq
void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)

Synopsis

void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)
#include «immintrin.h»
Instruction: vpmaskmovq m256, ymm, ymm
CPUID Flags: AVX2

Description

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
vmaskmovpd
void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)

Synopsis

void _mm_maskstore_pd (double * mem_addr, __m128i mask, __m128d a)
#include «immintrin.h»
Instruction: vmaskmovpd m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 1 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovpd
void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)

Synopsis

void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)
#include «immintrin.h»
Instruction: vmaskmovpd m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*64 IF mask[i+63] MEM[mem_addr+i+63:mem_addr+i] := a[i+63:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)

Synopsis

void _mm_maskstore_ps (float * mem_addr, __m128i mask, __m128 a)
#include «immintrin.h»
Instruction: vmaskmovps m128, xmm, xmm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 3 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vmaskmovps
void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)

Synopsis

void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)
#include «immintrin.h»
Instruction: vmaskmovps m256, ymm, ymm
CPUID Flags: AVX

Description

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

Operation

FOR j := 0 to 7 i := j*32 IF mask[i+31] MEM[mem_addr+i+31:mem_addr+i] := a[i+31:i] FI ENDFOR

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 1
Sandy Bridge 1
vpmaxsw
__m256i _mm256_max_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxsd
__m256i _mm256_max_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxsb
__m256i _mm256_max_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxuw
__m256i _mm256_max_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] > b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpmaxud
__m256i _mm256_max_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] > b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpmaxub
__m256i _mm256_max_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_max_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmaxub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] > b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmaxpd
__m256d _mm256_max_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_max_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmaxpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MAX(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmaxps
__m256 _mm256_max_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_max_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmaxps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MAX(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpminsw
__m256i _mm256_min_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsd
__m256i _mm256_min_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminsb
__m256i _mm256_min_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminuw
__m256i _mm256_min_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 15 i := j*16 IF a[i+15:i] < b[i+15:i] dst[i+15:i] := a[i+15:i] ELSE dst[i+15:i] := b[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminud
__m256i _mm256_min_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminud ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31:i] < b[i+31:i] dst[i+31:i] := a[i+31:i] ELSE dst[i+31:i] := b[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpminub
__m256i _mm256_min_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_min_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpminub ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 31 i := j*8 IF a[i+7:i] < b[i+7:i] dst[i+7:i] := a[i+7:i] ELSE dst[i+7:i] := b[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vminpd
__m256d _mm256_min_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_min_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vminpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := MIN(a[i+63:i], b[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vminps
__m256 _mm256_min_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_min_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vminps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := MIN(a[i+31:i], b[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vmovddup
__m256d _mm256_movedup_pd (__m256d a)

Synopsis

__m256d _mm256_movedup_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovddup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.

Operation

dst[63:0] := a[63:0] dst[127:64] := a[63:0] dst[191:128] := a[191:128] dst[255:192] := a[191:128] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovshdup
__m256 _mm256_movehdup_ps (__m256 a)

Synopsis

__m256 _mm256_movehdup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovshdup ymm, ymm
CPUID Flags: AVX

Description

Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[63:32] dst[63:32] := a[63:32] dst[95:64] := a[127:96] dst[127:96] := a[127:96] dst[159:128] := a[191:160] dst[191:160] := a[191:160] dst[223:192] := a[255:224] dst[255:224] := a[255:224] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vmovsldup
__m256 _mm256_moveldup_ps (__m256 a)

Synopsis

__m256 _mm256_moveldup_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovsldup ymm, ymm
CPUID Flags: AVX

Description

Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Operation

dst[31:0] := a[31:0] dst[63:32] := a[31:0] dst[95:64] := a[95:64] dst[127:96] := a[95:64] dst[159:128] := a[159:128] dst[191:160] := a[159:128] dst[223:192] := a[223:192] dst[255:224] := a[223:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpmovmskb
int _mm256_movemask_epi8 (__m256i a)

Synopsis

int _mm256_movemask_epi8 (__m256i a)
#include «immintrin.h»
Instruction: vpmovmskb r32, ymm
CPUID Flags: AVX2

Description

Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[j] := a[i+7] ENDFOR

Performance

Architecture Latency Throughput
Haswell 3
vmovmskpd
int _mm256_movemask_pd (__m256d a)

Synopsis

int _mm256_movemask_pd (__m256d a)
#include «immintrin.h»
Instruction: vmovmskpd r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

Operation

FOR j := 0 to 3 i := j*64 IF a[i+63] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:4] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmovmskps
int _mm256_movemask_ps (__m256 a)

Synopsis

int _mm256_movemask_ps (__m256 a)
#include «immintrin.h»
Instruction: vmovmskps r32, ymm
CPUID Flags: AVX

Description

Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.

Operation

FOR j := 0 to 7 i := j*32 IF a[i+31] dst[j] := 1 ELSE dst[j] := 0 FI ENDFOR dst[MAX:8] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 2
Sandy Bridge 2
vmpsadbw
__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vmpsadbw ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from bstarting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.

Operation

MPSADBW(a[127:0], b[127:0], imm8[2:0]) { a_offset := imm8[2]*32 b_offset := imm8[1:0]*32 FOR j := 0 to 7 i := j*8 k := a_offset+i l := b_offset tmp[i+15:i] := ABS(a[k+7:k] — b[l+7:l]) + ABS(a[k+15:k+8] — b[l+15:l+8]) + ABS(a[k+23:k+16] — b[l+23:l+16]) + ABS(a[k+31:k+24] — b[l+31:l+24]) ENDFOR RETURN tmp[127:0] } dst[127:0] := MPSADBW(a[127:0], b[127:0], imm8[2:0]) dst[255:128] := MPSADBW(a[255:128], b[255:128], imm8[5:3]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 2
vpmuldq
__m256i _mm256_mul_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmuludq
__m256i _mm256_mul_epu32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mul_epu32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmuludq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vmulpd
__m256d _mm256_mul_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_mul_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vmulpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] * b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vmulps
__m256 _mm256_mul_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_mul_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vmulps ymm, ymm, ymm
CPUID Flags: AVX

Description

Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] * b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
vpmulhw
__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulhuw
__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhuw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[31:16] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5
vpmulhrsw
__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulhrsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1 dst[i+15:i] := tmp[16:1] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmullw
__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmullw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 15 i := j*16 tmp[31:0] := a[i+15:i] * b[i+15:i] dst[i+15:i] := tmp[15:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
vpmulld
__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpmulld ymm, ymm, ymm
CPUID Flags: AVX2

Description

Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

Operation

FOR j := 0 to 7 i := j*32 tmp[63:0] := a[i+31:i] * b[i+31:i] dst[i+31:i] := tmp[31:0] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 10 1
vorpd
__m256d _mm256_or_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_or_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] BITWISE OR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vorps
__m256 _mm256_or_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_or_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] BITWISE OR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpor
__m256i _mm256_or_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_or_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] OR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.33
vpacksswb
__m256i _mm256_packs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpacksswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_Int8 (a[15:0]) dst[15:8] := Saturate_Int16_To_Int8 (a[31:16]) dst[23:16] := Saturate_Int16_To_Int8 (a[47:32]) dst[31:24] := Saturate_Int16_To_Int8 (a[63:48]) dst[39:32] := Saturate_Int16_To_Int8 (a[79:64]) dst[47:40] := Saturate_Int16_To_Int8 (a[95:80]) dst[55:48] := Saturate_Int16_To_Int8 (a[111:96]) dst[63:56] := Saturate_Int16_To_Int8 (a[127:112]) dst[71:64] := Saturate_Int16_To_Int8 (b[15:0]) dst[79:72] := Saturate_Int16_To_Int8 (b[31:16]) dst[87:80] := Saturate_Int16_To_Int8 (b[47:32]) dst[95:88] := Saturate_Int16_To_Int8 (b[63:48]) dst[103:96] := Saturate_Int16_To_Int8 (b[79:64]) dst[111:104] := Saturate_Int16_To_Int8 (b[95:80]) dst[119:112] := Saturate_Int16_To_Int8 (b[111:96]) dst[127:120] := Saturate_Int16_To_Int8 (b[127:112]) dst[135:128] := Saturate_Int16_To_Int8 (a[143:128]) dst[143:136] := Saturate_Int16_To_Int8 (a[159:144]) dst[151:144] := Saturate_Int16_To_Int8 (a[175:160]) dst[159:152] := Saturate_Int16_To_Int8 (a[191:176]) dst[167:160] := Saturate_Int16_To_Int8 (a[207:192]) dst[175:168] := Saturate_Int16_To_Int8 (a[223:208]) dst[183:176] := Saturate_Int16_To_Int8 (a[239:224]) dst[191:184] := Saturate_Int16_To_Int8 (a[255:240]) dst[199:192] := Saturate_Int16_To_Int8 (b[143:128]) dst[207:200] := Saturate_Int16_To_Int8 (b[159:144]) dst[215:208] := Saturate_Int16_To_Int8 (b[175:160]) dst[223:216] := Saturate_Int16_To_Int8 (b[191:176]) dst[231:224] := Saturate_Int16_To_Int8 (b[207:192]) dst[239:232] := Saturate_Int16_To_Int8 (b[223:208]) dst[247:240] := Saturate_Int16_To_Int8 (b[239:224]) dst[255:248] := Saturate_Int16_To_Int8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpackssdw
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackssdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_Int16 (a[31:0]) dst[31:16] := Saturate_Int32_To_Int16 (a[63:32]) dst[47:32] := Saturate_Int32_To_Int16 (a[95:64]) dst[63:48] := Saturate_Int32_To_Int16 (a[127:96]) dst[79:64] := Saturate_Int32_To_Int16 (b[31:0]) dst[95:80] := Saturate_Int32_To_Int16 (b[63:32]) dst[111:96] := Saturate_Int32_To_Int16 (b[95:64]) dst[127:112] := Saturate_Int32_To_Int16 (b[127:96]) dst[143:128] := Saturate_Int32_To_Int16 (a[159:128]) dst[159:144] := Saturate_Int32_To_Int16 (a[191:160]) dst[175:160] := Saturate_Int32_To_Int16 (a[223:192]) dst[191:176] := Saturate_Int32_To_Int16 (a[255:224]) dst[207:192] := Saturate_Int32_To_Int16 (b[159:128]) dst[223:208] := Saturate_Int32_To_Int16 (b[191:160]) dst[239:224] := Saturate_Int32_To_Int16 (b[223:192]) dst[255:240] := Saturate_Int32_To_Int16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackuswb
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackuswb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[7:0] := Saturate_Int16_To_UnsignedInt8 (a[15:0]) dst[15:8] := Saturate_Int16_To_UnsignedInt8 (a[31:16]) dst[23:16] := Saturate_Int16_To_UnsignedInt8 (a[47:32]) dst[31:24] := Saturate_Int16_To_UnsignedInt8 (a[63:48]) dst[39:32] := Saturate_Int16_To_UnsignedInt8 (a[79:64]) dst[47:40] := Saturate_Int16_To_UnsignedInt8 (a[95:80]) dst[55:48] := Saturate_Int16_To_UnsignedInt8 (a[111:96]) dst[63:56] := Saturate_Int16_To_UnsignedInt8 (a[127:112]) dst[71:64] := Saturate_Int16_To_UnsignedInt8 (b[15:0]) dst[79:72] := Saturate_Int16_To_UnsignedInt8 (b[31:16]) dst[87:80] := Saturate_Int16_To_UnsignedInt8 (b[47:32]) dst[95:88] := Saturate_Int16_To_UnsignedInt8 (b[63:48]) dst[103:96] := Saturate_Int16_To_UnsignedInt8 (b[79:64]) dst[111:104] := Saturate_Int16_To_UnsignedInt8 (b[95:80]) dst[119:112] := Saturate_Int16_To_UnsignedInt8 (b[111:96]) dst[127:120] := Saturate_Int16_To_UnsignedInt8 (b[127:112]) dst[135:128] := Saturate_Int16_To_UnsignedInt8 (a[143:128]) dst[143:136] := Saturate_Int16_To_UnsignedInt8 (a[159:144]) dst[151:144] := Saturate_Int16_To_UnsignedInt8 (a[175:160]) dst[159:152] := Saturate_Int16_To_UnsignedInt8 (a[191:176]) dst[167:160] := Saturate_Int16_To_UnsignedInt8 (a[207:192]) dst[175:168] := Saturate_Int16_To_UnsignedInt8 (a[223:208]) dst[183:176] := Saturate_Int16_To_UnsignedInt8 (a[239:224]) dst[191:184] := Saturate_Int16_To_UnsignedInt8 (a[255:240]) dst[199:192] := Saturate_Int16_To_UnsignedInt8 (b[143:128]) dst[207:200] := Saturate_Int16_To_UnsignedInt8 (b[159:144]) dst[215:208] := Saturate_Int16_To_UnsignedInt8 (b[175:160]) dst[223:216] := Saturate_Int16_To_UnsignedInt8 (b[191:176]) dst[231:224] := Saturate_Int16_To_UnsignedInt8 (b[207:192]) dst[239:232] := Saturate_Int16_To_UnsignedInt8 (b[223:208]) dst[247:240] := Saturate_Int16_To_UnsignedInt8 (b[239:224]) dst[255:248] := Saturate_Int16_To_UnsignedInt8 (b[255:240]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpackusdw
__m256i _mm256_packus_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_packus_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpackusdw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Convert packed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.

Operation

dst[15:0] := Saturate_Int32_To_UnsignedInt16 (a[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (a[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (a[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (a[127:96]) dst[79:64] := Saturate_Int32_To_UnsignedInt16 (b[31:0]) dst[95:80] := Saturate_Int32_To_UnsignedInt16 (b[63:32]) dst[111:96] := Saturate_Int32_To_UnsignedInt16 (b[95:64]) dst[127:112] := Saturate_Int32_To_UnsignedInt16 (b[127:96]) dst[143:128] := Saturate_Int32_To_UnsignedInt16 (a[159:128]) dst[159:144] := Saturate_Int32_To_UnsignedInt16 (a[191:160]) dst[175:160] := Saturate_Int32_To_UnsignedInt16 (a[223:192]) dst[191:176] := Saturate_Int32_To_UnsignedInt16 (a[255:224]) dst[207:192] := Saturate_Int32_To_UnsignedInt16 (b[159:128]) dst[223:208] := Saturate_Int32_To_UnsignedInt16 (b[191:160]) dst[239:224] := Saturate_Int32_To_UnsignedInt16 (b[223:192]) dst[255:240] := Saturate_Int32_To_UnsignedInt16 (b[255:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpermilpd
__m128d _mm_permute_pd (__m128d a, int imm8)

Synopsis

__m128d _mm_permute_pd (__m128d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permute_pd (__m256d a, int imm8)

Synopsis

__m256d _mm256_permute_pd (__m256d a, int imm8)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

IF (imm8[0] == 0) dst[63:0] := a[63:0] IF (imm8[0] == 1) dst[63:0] := a[127:64] IF (imm8[1] == 0) dst[127:64] := a[63:0] IF (imm8[1] == 1) dst[127:64] := a[127:64] IF (imm8[2] == 0) dst[191:128] := a[191:128] IF (imm8[2] == 1) dst[191:128] := a[255:192] IF (imm8[3] == 0) dst[255:192] := a[191:128] IF (imm8[3] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m128 _mm_permute_ps (__m128 a, int imm8)

Synopsis

__m128 _mm_permute_ps (__m128 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permute_ps (__m256 a, int imm8)

Synopsis

__m256 _mm256_permute_ps (__m256 a, int imm8)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vperm2f128
__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)

Synopsis

__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)

Synopsis

__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2f128
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)

Synopsis

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
#include «immintrin.h»
Instruction: vperm2f128 ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vperm2i128
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)

Synopsis

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
#include «immintrin.h»
Instruction: vperm2i128 ymm, ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

Operation

SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermq
__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpermq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermpd
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)

Synopsis

__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
#include «immintrin.h»
Instruction: vpermpd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[63:0] := src[63:0] 1: tmp[63:0] := src[127:64] 2: tmp[63:0] := src[191:128] 3: tmp[63:0] := src[255:192] ESAC RETURN tmp[63:0] } dst[63:0] := SELECT4(a[255:0], imm8[1:0]) dst[127:64] := SELECT4(a[255:0], imm8[3:2]) dst[191:128] := SELECT4(a[255:0], imm8[5:4]) dst[255:192] := SELECT4(a[255:0], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vpermilpd
__m128d _mm_permutevar_pd (__m128d a, __m128i b)

Synopsis

__m128d _mm_permutevar_pd (__m128d a, __m128i b)
#include «immintrin.h»
Instruction: vpermilpd xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilpd
__m256d _mm256_permutevar_pd (__m256d a, __m256i b)

Synopsis

__m256d _mm256_permutevar_pd (__m256d a, __m256i b)
#include «immintrin.h»
Instruction: vpermilpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

IF (b[1] == 0) dst[63:0] := a[63:0] IF (b[1] == 1) dst[63:0] := a[127:64] IF (b[65] == 0) dst[127:64] := a[63:0] IF (b[65] == 1) dst[127:64] := a[127:64] IF (b[129] == 0) dst[191:128] := a[191:128] IF (b[129] == 1) dst[191:128] := a[255:192] IF (b[193] == 0) dst[255:192] := a[191:128] IF (b[193] == 1) dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermilps
__m128 _mm_permutevar_ps (__m128 a, __m128i b)

Synopsis

__m128 _mm_permutevar_ps (__m128 a, __m128i b)
#include «immintrin.h»
Instruction: vpermilps xmm, xmm, xmm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
Ivy Bridge 1
Sandy Bridge 1
vpermilps
__m256 _mm256_permutevar_ps (__m256 a, __m256i b)

Synopsis

__m256 _mm256_permutevar_ps (__m256 a, __m256i b)
#include «immintrin.h»
Instruction: vpermilps ymm, ymm, ymm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], b[1:0]) dst[63:32] := SELECT4(a[127:0], b[33:32]) dst[95:64] := SELECT4(a[127:0], b[65:64]) dst[127:96] := SELECT4(a[127:0], b[97:96]) dst[159:128] := SELECT4(a[255:128], b[129:128]) dst[191:160] := SELECT4(a[255:128], b[161:160]) dst[223:192] := SELECT4(a[255:128], b[193:192]) dst[255:224] := SELECT4(a[255:128], b[225:224]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpermd
__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)

Synopsis

__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)
#include «immintrin.h»
Instruction: vpermd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
vpermps
__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Synopsis

__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)
#include «immintrin.h»
Instruction: vpermps ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.

Operation

FOR j := 0 to 7 i := j*32 id := idx[i+2:i]*32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
vrcpps
__m256 _mm256_rcp_ps (__m256 a)

Synopsis

__m256 _mm256_rcp_ps (__m256 a)
#include «immintrin.h»
Instruction: vrcpps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0/a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vroundpd
__m256d _mm256_round_pd (__m256d a, int rounding)

Synopsis

__m256d _mm256_round_pd (__m256d a, int rounding)
#include «immintrin.h»
Instruction: vroundpd ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ROUND(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vroundps
__m256 _mm256_round_ps (__m256 a, int rounding)

Synopsis

__m256 _mm256_round_ps (__m256 a, int rounding)
#include «immintrin.h»
Instruction: vroundps ymm, ymm, imm
CPUID Flags: AVX

Description

Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements in dst.
Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ROUND(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 6 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vrsqrtps
__m256 _mm256_rsqrt_ps (__m256 a)

Synopsis

__m256 _mm256_rsqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vrsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := APPROXIMATE(1.0 / SQRT(a[i+31:i])) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 7 1
Ivy Bridge 7 1
Sandy Bridge 7 1
vpsadbw
__m256i _mm256_sad_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sad_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsadbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

Operation

FOR j := 0 to 31 i := j*8 tmp[i+7:i] := ABS(a[i+7:i] — b[i+7:i]) ENDFOR FOR j := 0 to 4 i := j*64 dst[i+15:i] := tmp[i+7:i] + tmp[i+15:i+8] + tmp[i+23:i+16] + tmp[i+31:i+24] + tmp[i+39:i+32] + tmp[i+47:i+40] + tmp[i+55:i+48] + tmp[i+63:i+56] dst[i+63:i+16] := 0 ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 5 1
__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values.

Operation

dst[15:0] := e0 dst[31:16] := e1 dst[47:32] := e2 dst[63:48] := e3 dst[79:64] := e4 dst[95:80] := e5 dst[111:96] := e6 dst[127:112] := e7 dst[145:128] := e8 dst[159:144] := e9 dst[175:160] := e10 dst[191:176] := e11 dst[207:192] := e12 dst[223:208] := e13 dst[239:224] := e14 dst[255:240] := e15 dst[MAX:256] := 0
__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e0 dst[15:8] := e1 dst[23:16] := e2 dst[31:24] := e3 dst[39:32] := e4 dst[47:40] := e5 dst[55:48] := e6 dst[63:56] := e7 dst[71:64] := e8 dst[79:72] := e9 dst[87:80] := e10 dst[95:88] := e11 dst[103:96] := e12 dst[111:104] := e13 dst[119:112] := e14 dst[127:120] := e15 dst[135:128] := e16 dst[143:136] := e17 dst[151:144] := e18 dst[159:152] := e19 dst[167:160] := e20 dst[175:168] := e21 dst[183:176] := e22 dst[191:184] := e23 dst[199:192] := e24 dst[207:200] := e25 dst[215:208] := e26 dst[223:216] := e27 dst[231:224] := e28 dst[239:232] := e29 dst[247:240] := e30 dst[255:248] := e31 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_set_m128 (__m128 hi, __m128 lo)

Synopsis

__m256 _mm256_set_m128 (__m128 hi, __m128 lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_set_m128d (__m128d hi, __m128d lo)

Synopsis

__m256d _mm256_set_m128d (__m128d hi, __m128d lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_set_m128i (__m128i hi, __m128i lo)

Synopsis

__m256i _mm256_set_m128i (__m128i hi, __m128i lo)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.

Operation

dst[63:0] := e0 dst[127:64] := e1 dst[191:128] := e2 dst[255:192] := e3 dst[MAX:256] := 0
__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.

Operation

dst[31:0] := e0 dst[63:32] := e1 dst[95:64] := e2 dst[127:96] := e3 dst[159:128] := e4 dst[191:160] := e5 dst[223:192] := e6 dst[255:224] := e7 dst[MAX:256] := 0
__m256i _mm256_set1_epi16 (short a)

Synopsis

__m256i _mm256_set1_epi16 (short a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[15:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi32 (int a)

Synopsis

__m256i _mm256_set1_epi32 (int a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi64x (long long a)

Synopsis

__m256i _mm256_set1_epi64x (long long a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_set1_epi8 (char a)

Synopsis

__m256i _mm256_set1_epi8 (char a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[7:0] ENDFOR dst[MAX:256] := 0
__m256d _mm256_set1_pd (double a)

Synopsis

__m256d _mm256_set1_pd (double a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast double-precision (64-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[63:0] ENDFOR dst[MAX:256] := 0
__m256 _mm256_set1_ps (float a)

Synopsis

__m256 _mm256_set1_ps (float a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Broadcast single-precision (32-bit) floating-point value a to all elements of dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[31:0] ENDFOR dst[MAX:256] := 0
__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Synopsis

__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 16-bit integers in dst with the supplied values in reverse order.

Operation

dst[15:0] := e15 dst[31:16] := e14 dst[47:32] := e13 dst[63:48] := e12 dst[79:64] := e11 dst[95:80] := e10 dst[111:96] := e9 dst[127:112] := e8 dst[145:128] := e7 dst[159:144] := e6 dst[175:160] := e5 dst[191:176] := e4 dst[207:192] := e3 dst[223:208] := e2 dst[239:224] := e1 dst[255:240] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Synopsis

__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 32-bit integers in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Synopsis

__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 64-bit integers in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, chare24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Synopsis

__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, chare9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed 8-bit integers in dst with the supplied values in reverse order.

Operation

dst[7:0] := e31 dst[15:8] := e30 dst[23:16] := e29 dst[31:24] := e28 dst[39:32] := e27 dst[47:40] := e26 dst[55:48] := e25 dst[63:56] := e24 dst[71:64] := e23 dst[79:72] := e22 dst[87:80] := e21 dst[95:88] := e20 dst[103:96] := e19 dst[111:104] := e18 dst[119:112] := e17 dst[127:120] := e16 dst[135:128] := e15 dst[143:136] := e14 dst[151:144] := e13 dst[159:152] := e12 dst[167:160] := e11 dst[175:168] := e10 dst[183:176] := e9 dst[191:184] := e8 dst[199:192] := e7 dst[207:200] := e6 dst[215:208] := e5 dst[223:216] := e4 dst[231:224] := e3 dst[239:232] := e2 dst[247:240] := e1 dst[255:248] := e0 dst[MAX:256] := 0
vinsertf128
__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)

Synopsis

__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256 vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)

Synopsis

__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256d vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vinsertf128
__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)

Synopsis

__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)
#include «immintrin.h»
Instruction: vinsertf128 ymm, ymm, xmm, imm
CPUID Flags: AVX

Description

Set packed __m256i vector dst with the supplied values.

Operation

dst[127:0] := lo[127:0] dst[255:128] := hi[127:0] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)

Synopsis

__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[63:0] := e3 dst[127:64] := e2 dst[191:128] := e1 dst[255:192] := e0 dst[MAX:256] := 0
__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

Synopsis

__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)
#include «immintrin.h»
CPUID Flags: AVX

Description

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.

Operation

dst[31:0] := e7 dst[63:32] := e6 dst[95:64] := e5 dst[127:96] := e4 dst[159:128] := e3 dst[191:160] := e2 dst[223:192] := e1 dst[255:224] := e0 dst[MAX:256] := 0
vxorpd
__m256d _mm256_setzero_pd (void)

Synopsis

__m256d _mm256_setzero_pd (void)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256d with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_setzero_ps (void)

Synopsis

__m256 _mm256_setzero_ps (void)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256 with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_setzero_si256 (void)

Synopsis

__m256i _mm256_setzero_si256 (void)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX

Description

Return vector of type __m256i with all elements set to zero.

Operation

dst[MAX:0] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpshufd
__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufd ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(a[127:0], imm8[5:4]) dst[127:96] := SELECT4(a[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(a[255:128], imm8[5:4]) dst[255:224] := SELECT4(a[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshufb
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpshufb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vshufpd
__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)

Synopsis

__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)
#include «immintrin.h»
Instruction: vshufpd ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

dst[63:0] := (imm8[0] == 0) ? a[63:0] : a[127:64] dst[127:64] := (imm8[1] == 0) ? b[63:0] : b[127:64] dst[191:128] := (imm8[2] == 0) ? a[191:128] : a[255:192] dst[255:192] := (imm8[3] == 0) ? b[191:128] : b[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vshufps
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

Synopsis

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
#include «immintrin.h»
Instruction: vshufps ymm, ymm, ymm, imm
CPUID Flags: AVX

Description

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

Operation

SELECT4(src, control){ CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0]) dst[63:32] := SELECT4(a[127:0], imm8[3:2]) dst[95:64] := SELECT4(b[127:0], imm8[5:4]) dst[127:96] := SELECT4(b[127:0], imm8[7:6]) dst[159:128] := SELECT4(a[255:128], imm8[1:0]) dst[191:160] := SELECT4(a[255:128], imm8[3:2]) dst[223:192] := SELECT4(b[255:128], imm8[5:4]) dst[255:224] := SELECT4(b[255:128], imm8[7:6]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpshufhw
__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshufhw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[63:0] := a[63:0] dst[79:64] := (a >> (imm8[1:0] * 16))[79:64] dst[95:80] := (a >> (imm8[3:2] * 16))[79:64] dst[111:96] := (a >> (imm8[5:4] * 16))[79:64] dst[127:112] := (a >> (imm8[7:6] * 16))[79:64] dst[191:128] := a[191:128] dst[207:192] := (a >> (imm8[1:0] * 16))[207:192] dst[223:208] := (a >> (imm8[3:2] * 16))[207:192] dst[239:224] := (a >> (imm8[5:4] * 16))[207:192] dst[255:240] := (a >> (imm8[7:6] * 16))[207:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpshuflw
__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpshuflw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.

Operation

dst[15:0] := (a >> (imm8[1:0] * 16))[15:0] dst[31:16] := (a >> (imm8[3:2] * 16))[15:0] dst[47:32] := (a >> (imm8[5:4] * 16))[15:0] dst[63:48] := (a >> (imm8[7:6] * 16))[15:0] dst[127:64] := a[127:64] dst[143:128] := (a >> (imm8[1:0] * 16))[143:128] dst[159:144] := (a >> (imm8[3:2] * 16))[143:128] dst[175:160] := (a >> (imm8[5:4] * 16))[143:128] dst[191:176] := (a >> (imm8[7:6] * 16))[143:128] dst[255:192] := a[255:192] dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpsignw
__m256i _mm256_sign_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 15 i := j*16 IF b[i+15:i] < 0 dst[i+15:i] := NEG(a[i+15:i]) ELSE IF b[i+15:i] = 0 dst[i+15:i] := 0 ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignd
__m256i _mm256_sign_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 7 i := j*32 IF b[i+31:i] < 0 dst[i+31:i] := NEG(a[i+31:i]) ELSE IF b[i+31:i] = 0 dst[i+31:i] := 0 ELSE dst[i+31:i] := a[i+31:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsignb
__m256i _mm256_sign_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sign_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsignb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Negate packed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

Operation

FOR j := 0 to 31 i := j*8 IF b[i+7:i] < 0 dst[i+7:i] := NEG(a[i+7:i]) ELSE IF b[i+7:i] = 0 dst[i+7:i] := 0 ELSE dst[i+7:i] := a[i+7:i] FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 0.5
vpsllw
__m256i _mm256_sll_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
Haswell 4
vpslld
__m256i _mm256_sll_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllq
__m256i _mm256_sll_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sll_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4 0.5
vpsllw
__m256i _mm256_slli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
Haswell 1
vpslld
__m256i _mm256_slli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpslld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllq
__m256i _mm256_slli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_slli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsllq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] << imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpslldq
__m256i _mm256_slli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_slli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpslldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] << (tmp*8) dst[255:128] := a[255:128] << (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvd
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvd
__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] << count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2 2
vpsllvq
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_sllv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsllvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsllvq
__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsllvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] << count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsqrtpd
__m256d _mm256_sqrt_pd (__m256d a)

Synopsis

__m256d _mm256_sqrt_pd (__m256d a)
#include «immintrin.h»
Instruction: vsqrtpd ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := SQRT(a[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 35 28
Ivy Bridge 35 28
Sandy Bridge 43 44
vsqrtps
__m256 _mm256_sqrt_ps (__m256 a)

Synopsis

__m256 _mm256_sqrt_ps (__m256 a)
#include «immintrin.h»
Instruction: vsqrtps ymm, ymm
CPUID Flags: AVX

Description

Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SQRT(a[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 21 14
Ivy Bridge 21 14
Sandy Bridge 29 28
vpsraw
__m256i _mm256_sra_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrad
__m256i _mm256_sra_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_sra_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsraw
__m256i _mm256_srai_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsraw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := SignBit ELSE dst[i+15:i] := SignExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrad
__m256i _mm256_srai_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srai_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrad ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := SignBit ELSE dst[i+31:i] := SignExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsravd
__m128i _mm_srav_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srav_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsravd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsravd
__m256i _mm256_srav_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srav_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsravd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := SignExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlw
__m256i _mm256_srl_epi16 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi16 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrld
__m256i _mm256_srl_epi32 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi32 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF count[63:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlq
__m256i _mm256_srl_epi64 (__m256i a, __m128i count)

Synopsis

__m256i _mm256_srl_epi64 (__m256i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF count[63:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[63:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 4
vpsrlw
__m256i _mm256_srli_epi16 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi16 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlw ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrld
__m256i _mm256_srli_epi32 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi32 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrld ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 IF imm8[7:0] > 31 dst[i+31:i] := 0 ELSE dst[i+31:i] := ZeroExtend(a[i+31:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlq
__m256i _mm256_srli_epi64 (__m256i a, int imm8)

Synopsis

__m256i _mm256_srli_epi64 (__m256i a, int imm8)
#include «immintrin.h»
Instruction: vpsrlq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 IF imm8[7:0] > 63 dst[i+63:i] := 0 ELSE dst[i+63:i] := ZeroExtend(a[i+63:i] >> imm8[7:0]) FI ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrldq
__m256i _mm256_srli_si256 (__m256i a, const int imm8)

Synopsis

__m256i _mm256_srli_si256 (__m256i a, const int imm8)
#include «immintrin.h»
Instruction: vpsrldq ymm, ymm, imm
CPUID Flags: AVX2

Description

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Operation

tmp := imm8[7:0] IF tmp > 15 tmp := 16 FI dst[127:0] := a[127:0] >> (tmp*8) dst[255:128] := a[255:128] >> (tmp*8) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi32 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvd xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvd
__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ZeroExtend(a[i+31:i] >> count[i+31:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 2
vpsrlvq
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)

Synopsis

__m128i _mm_srlv_epi64 (__m128i a, __m128i count)
#include «immintrin.h»
Instruction: vpsrlvq xmm, xmm, xmm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:128] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsrlvq
__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)

Synopsis

__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)
#include «immintrin.h»
Instruction: vpsrlvq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ZeroExtend(a[i+63:i] >> count[i+63:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vmovapd
void _mm256_store_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_store_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovapd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovaps
void _mm256_store_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_store_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovaps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqa
void _mm256_store_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqa m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovupd
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_storeu_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovupd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovups
void _mm256_storeu_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_storeu_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovups m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovdqu
void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovdqu m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)

Synopsis

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)

Synopsis

void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Synopsis

void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)
#include «immintrin.h»
CPUID Flags: AVX

Description

Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

Operation

MEM[loaddr+127:loaddr] := a[127:0] MEM[hiaddr+127:hiaddr] := a[255:128]
vmovntdqa
__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)

Synopsis

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)
#include «immintrin.h»
Instruction: vmovntdqa ymm, m256
CPUID Flags: AVX2

Description

Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0
vmovntpd
void _mm256_stream_pd (double * mem_addr, __m256d a)

Synopsis

void _mm256_stream_pd (double * mem_addr, __m256d a)
#include «immintrin.h»
Instruction: vmovntpd m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntps
void _mm256_stream_ps (float * mem_addr, __m256 a)

Synopsis

void _mm256_stream_ps (float * mem_addr, __m256 a)
#include «immintrin.h»
Instruction: vmovntps m256, ymm
CPUID Flags: AVX

Description

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint.mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vmovntdq
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)

Synopsis

void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
#include «immintrin.h»
Instruction: vmovntdq m256, ymm
CPUID Flags: AVX

Description

Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

MEM[mem_addr+255:mem_addr] := a[255:0]
vpsubw
__m256i _mm256_sub_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] — b[i+15:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubd
__m256i _mm256_sub_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubq
__m256i _mm256_sub_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubb
__m256i _mm256_sub_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_sub_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] — b[i+7:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vsubpd
__m256d _mm256_sub_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_sub_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vsubpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] — b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vsubps
__m256 _mm256_sub_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_sub_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vsubps ymm, ymm, ymm
CPUID Flags: AVX

Description

Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] — b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 3 1
Sandy Bridge 3 1
vpsubsw
__m256i _mm256_subs_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubsb
__m256i _mm256_subs_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubsb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusw
__m256i _mm256_subs_epu16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16(a[i+15:i] — b[i+15:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vpsubusb
__m256i _mm256_subs_epu8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_subs_epu8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpsubusb ymm, ymm, ymm
CPUID Flags: AVX2

Description

Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.

Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_UnsignedInt8(a[i+7:i] — b[i+7:i]) ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vtestpd
int _mm_testc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN CF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testnzc_pd (__m128d a, __m128d b)

Synopsis

int _mm_testnzc_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testnzc_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testnzc_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testnzc_ps (__m128 a, __m128 b)

Synopsis

int _mm_testnzc_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testnzc_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testnzc_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testnzc_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testnzc_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI IF (ZF == 0 && CF == 0) RETURN 1 ELSE RETURN 0 FI

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
vtestpd
int _mm_testz_pd (__m128d a, __m128d b)

Synopsis

int _mm_testz_pd (__m128d a, __m128d b)
#include «immintrin.h»
Instruction: vtestpd xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[63] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[63] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestpd
int _mm256_testz_pd (__m256d a, __m256d b)

Synopsis

int _mm256_testz_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vtestpd ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[63] == tmp[127] == tmp[191] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vtestps
int _mm_testz_ps (__m128 a, __m128 b)

Synopsis

int _mm_testz_ps (__m128 a, __m128 b)
#include «immintrin.h»
Instruction: vtestps xmm, xmm
CPUID Flags: AVX

Description

Compute the bitwise AND of 128 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 128-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[127:0] := a[127:0] AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) ZF := 1 ELSE ZF := 0 FI tmp[127:0] := (NOT a[127:0]) AND b[127:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3
Ivy Bridge 1
Sandy Bridge 1
vtestps
int _mm256_testz_ps (__m256 a, __m256 b)

Synopsis

int _mm256_testz_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vtestps ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

Operation

tmp[255:0] := a[255:0] AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) ZF := 1 ELSE ZF := 0 FI tmp[255:0] := (NOT a[255:0]) AND b[255:0] IF (tmp[31] == tmp[63] == tmp[95] == tmp[127] == tmp[159] == tmp[191] == tmp[223] == tmp[255] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 3 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vptest
int _mm256_testz_si256 (__m256i a, __m256i b)

Synopsis

int _mm256_testz_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vptest ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

Operation

IF (a[255:0] AND b[255:0] == 0) ZF := 1 ELSE ZF := 0 FI IF ((NOT a[255:0]) AND b[255:0] == 0) CF := 1 ELSE CF := 0 FI RETURN ZF

Performance

Architecture Latency Throughput
Haswell 4
Ivy Bridge 2
Sandy Bridge 2
__m256d _mm256_undefined_pd (void)

Synopsis

__m256d _mm256_undefined_pd (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256d with undefined elements.
__m256 _mm256_undefined_ps (void)

Synopsis

__m256 _mm256_undefined_ps (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256 with undefined elements.
__m256i _mm256_undefined_si256 (void)

Synopsis

__m256i _mm256_undefined_si256 (void)
#include «immintrin.h»
CPUID Flags: AVX

Description

Return vector of type __m256i with undefined elements.
vpunpckhwd
__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[79:64] dst[31:16] := src2[79:64] dst[47:32] := src1[95:80] dst[63:48] := src2[95:80] dst[79:64] := src1[111:96] dst[95:80] := src2[111:96] dst[111:96] := src1[127:112] dst[127:112] := src2[127:112] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhdq
__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhqdq
__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckhbw
__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckhbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[71:64] dst[15:8] := src2[71:64] dst[23:16] := src1[79:72] dst[31:24] := src2[79:72] dst[39:32] := src1[87:80] dst[47:40] := src2[87:80] dst[55:48] := src1[95:88] dst[63:56] := src2[95:88] dst[71:64] := src1[103:96] dst[79:72] := src2[103:96] dst[87:80] := src1[111:104] dst[95:88] := src2[111:104] dst[103:96] := src1[119:112] dst[111:104] := src2[119:112] dst[119:112] := src1[127:120] dst[127:120] := src2[127:120] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpckhpd
__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpckhpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpckhps
__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpckhps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_HIGH_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpunpcklwd
__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_WORDS(src1[127:0], src2[127:0]){ dst[15:0] := src1[15:0] dst[31:16] := src2[15:0] dst[47:32] := src1[31:16] dst[63:48] := src2[31:16] dst[79:64] := src1[47:32] dst[95:80] := src2[47:32] dst[111:96] := src1[63:48] dst[127:112] := src2[63:48] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpckldq
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpckldq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklqdq
__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklqdq ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vpunpcklbw
__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpunpcklbw ymm, ymm, ymm
CPUID Flags: AVX2

Description

Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_BYTES(src1[127:0], src2[127:0]){ dst[7:0] := src1[7:0] dst[15:8] := src2[7:0] dst[23:16] := src1[15:8] dst[31:24] := src2[15:8] dst[39:32] := src1[23:16] dst[47:40] := src2[23:16] dst[55:48] := src1[31:24] dst[63:56] := src2[31:24] dst[71:64] := src1[39:32] dst[79:72] := src2[39:32] dst[87:80] := src1[47:40] dst[95:88] := src2[47:40] dst[103:96] := src1[55:48] dst[111:104] := src2[55:48] dst[119:112] := src1[63:56] dst[127:120] := src2[63:56] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_BYTES(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_BYTES(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
vunpcklpd
__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vunpcklpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_QWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vunpcklps
__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vunpcklps ymm, ymm, ymm
CPUID Flags: AVX

Description

Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Operation

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorpd
__m256d _mm256_xor_pd (__m256d a, __m256d b)

Synopsis

__m256d _mm256_xor_pd (__m256d a, __m256d b)
#include «immintrin.h»
Instruction: vxorpd ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] XOR b[i+63:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vxorps
__m256 _mm256_xor_ps (__m256 a, __m256 b)

Synopsis

__m256 _mm256_xor_ps (__m256 a, __m256 b)
#include «immintrin.h»
Instruction: vxorps ymm, ymm, ymm
CPUID Flags: AVX

Description

Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] XOR b[i+31:i] ENDFOR dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1 1
Ivy Bridge 1 1
Sandy Bridge 1 1
vpxor
__m256i _mm256_xor_si256 (__m256i a, __m256i b)

Synopsis

__m256i _mm256_xor_si256 (__m256i a, __m256i b)
#include «immintrin.h»
Instruction: vpxor ymm, ymm, ymm
CPUID Flags: AVX2

Description

Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

Operation

dst[255:0] := (a[255:0] XOR b[255:0]) dst[MAX:256] := 0

Performance

Architecture Latency Throughput
Haswell 1
vzeroall
void _mm256_zeroall (void)

Synopsis

void _mm256_zeroall (void)
#include «immintrin.h»
Instruction: vzeroall
CPUID Flags: AVX

Description

Zero the contents of all XMM or YMM registers.

Operation

YMM0[MAX:0] := 0 YMM1[MAX:0] := 0 YMM2[MAX:0] := 0 YMM3[MAX:0] := 0 YMM4[MAX:0] := 0 YMM5[MAX:0] := 0 YMM6[MAX:0] := 0 YMM7[MAX:0] := 0 IF 64-bit mode YMM8[MAX:0] := 0 YMM9[MAX:0] := 0 YMM10[MAX:0] := 0 YMM11[MAX:0] := 0 YMM12[MAX:0] := 0 YMM13[MAX:0] := 0 YMM14[MAX:0] := 0 YMM15[MAX:0] := 0 FI
vzeroupper
void _mm256_zeroupper (void)

Synopsis

void _mm256_zeroupper (void)
#include «immintrin.h»
Instruction: vzeroupper
CPUID Flags: AVX

Description

Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.

Operation

YMM0[MAX:128] := 0 YMM1[MAX:128] := 0 YMM2[MAX:128] := 0 YMM3[MAX:128] := 0 YMM4[MAX:128] := 0 YMM5[MAX:128] := 0 YMM6[MAX:128] := 0 YMM7[MAX:128] := 0 IF 64-bit mode YMM8[MAX:128] := 0 YMM9[MAX:128] := 0 YMM10[MAX:128] := 0 YMM11[MAX:128] := 0 YMM12[MAX:128] := 0 YMM13[MAX:128] := 0 YMM14[MAX:128] := 0 YMM15[MAX:128] := 0 FI

Performance

Architecture Latency Throughput
Haswell 0 1
Ivy Bridge 0 1
Sandy Bridge 0 1

AVX — Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD: 43 комментария

  1. No matter how happy people may be with their life, stress may find its way in. Sometimes stress is so hard to control because people do not know how to go about fixing their stresses. In the following article, you are going to be given advice to help you deal with life’s stresses.

    [url=https://www.acheterviagrafr24.com/pilule-comme-le-viagra/]pilule comme le viagra[/url]

  2. Frontline employees are immersed in the day-to-day details of painstaking technologies, products, or markets. No complete is more wizard in the realities of a performers’s affair than they are. But while these employees are deluged with incomparably delineated word, they normally texture it exceptionally difficult to rot that information into useful knowledge. Seeking united attachment, signals from the marketplace can be clouded and ambiguous. For of [url=http://annoyedcanadian.com]http://annoyedcanadian.com[/url] another, employees can prayer so caught up in their own proscribed with respect to make an effort to of feeling that they lose dispassionate of the broader context.

    The more holistic stumble upon close to to cognition at multitudinous Japanese companies is also founded on another fundamental insight. A companions is not a motor motor but a living organism. Much like an unitary, it can prepare a collective pick up of indistinguishability and necessary purpose. This is the organizational corresponding of self-knowledge—a shared understanding of what the troop stands as regards, where it is flourishing, what indulgent of annulus it wants to vigorous in, and, most respected, how to pounce upon that swarm a reality.

    Nonaka and Takeuchi are arguing that creating intelligence disposition happen to the indicator hint to sustaining a competitive usefulness in the future. Because the competitive setting and purchaser preferences changes constantly, awareness perishes quickly. With The Knowledge-Creating Proprietorship, managers give childbirth to at their fingertips years of acuity from Japanese firms that ventilate how to form formulation continuously, and how to take possession of gain of it to cause thriving imaginative products, services, and systems .

  3. Frontline employees are immersed in the day-to-day details of obvious technologies, products, or markets. No united is more master in the realities of a company’s doings than they are. But while these employees are deluged with hugely definitive tidings, they ordinarily reveal it very baffling to veer that tidings into gainful knowledge. Seeking a woman import, signals from the marketplace can be non-specific and ambiguous. Mission of [url=http://annoyedcanadian.com]http://annoyedcanadian.com[/url] another, employees can befit so caught up in their own proscribed relevancy of hope that they suffer defeat monstrosity of the broader context.

    The more holistic natter up advances to cognition at multitudinous Japanese companies is also founded on another fundamental insight. A crowd is not a motor car but a living organism. Much like an one, it can distribute lineage to a collective pick up of uniqueness and outset purpose. This is the organizational contract of self-knowledge—a shared deftness of what the brand-new zealand stands as regards, where it is going, what benevolent of annulus it wants to palpable in, and, most respected, how to baffle that pour a reality.

    Nonaka and Takeuchi are arguing that creating discernment when one pleases expand the opener to sustaining a competitive advance in the future. Because the competitive environs and customer preferences changes constantly, bursarship perishes quickly. With The Knowledge-Creating Proprietorship, managers entrust origination to at their fingertips years of perspicacity from Japanese firms that ventilate how to mania knowledgeable continuously, and how to control it to ordain in the money in fashion products, services, and systems .

  4. Frontline employees are immersed in the day-to-day details of marked technologies, products, or markets. No everybody is more boffin in the realities of a comrades’s speciality than they are. But while these employees are deluged with incomparably delineated information, they at the same time again be aware it damned taxing to press across that dope into gainful knowledge. Seeking a maid high regard, signals from the marketplace can be blurred and ambiguous. For the purpose another, employees can skirt so caught up in their own cramped slant that they conquered descry of the broader context.

    The more holistic compare with to cognition at manifold Japanese companies is also founded on another cornerstone insight. A ensemble is not a appliance but a living organism. Much like an one, it can give birth to a collective opinion of unanimity and first purpose. This is the organizational corresponding of self-knowledge—a shared understanding of what the troop stands as regards, where it is prospering, what kind of court it wants to energetic in, and, most effective, how to advise in that community a reality.

    Nonaka and Takeuchi are arguing that creating understanding gone haywire grow the direct to sustaining a competitive dominance in the future. Because the competitive environs and buyer preferences changes constantly, bursary perishes quickly. With The Knowledge-Creating Players, managers be struck by means of at their fingertips years of perspicacity from Japanese firms that air how to the latest thing capability continuously, and how to control it to skip town famed brand-new products, services, and systems .

  5. ome people, especially those running on busy daily schedules tend to use the pills to help maintain weight since they can not afford to follow all the diet programs. This is not advised. It is recommended that one seek advice from a professional in this field before using the pills. This can save one from many dangers associated with the misuse.

    The diet pills should always be taken whole. Some people tend to divide the pills to serve a longer period of time. This is not advised and can lead to ineffectiveness. If it is required that one takes a complete tablet, it means that a certain amount of the ingredients are required to achieve the desired goal. It is also recommended that one does not crush the pill and dissolve it in beverages. Chemicals found in beverages have the potential of neutralizing the desired nutrients in the pill thereby leading to ineffectiveness. The best way to take the tablets is swallowing them whole with a glass of water.

    [url=https://www.cialissansordonnancefr24.com/prix-du-cialis-20mg-en-france/]https://www.cialissansordonnancefr24.com/prix-du-cialis-20mg-en-france/[/url]

  6. Kлaсс!!! Нeт слoв прoстo, oдни эмоции!) По фотoгpaфии вcе yвидите!) Зa дeнюжки спaсибо вам! BАШ BЫИГPЫШ МОЖET СOСTABИТЬ ДО 10 000 € tinyurl.com/y77b59lb

  7. ООО «ЦЭРИ — это будущее в сфере сетевых веб-технологий. Если вам необходимо присоединить сообщества, райцентры, масштабные корпорации или даже крупные предприятия – вы с легкостью имеете возможность писать к профессионалам.

    Сообщество на профессиональном уровне занимается [url=http://center-energo.com/articles/elektromontajnyie_rabotyi_na_proizvodstvennyih_obektah]промышленный электромонтаж[/url] . Консультанты, которые работают в фирме – это сотрудники с опытом.

    Они не только подключают сети, но и планируют разные проекты. С помощью их работы реально наращивать электроснабжение, они специализируются на разных моментах. Если у вас есть трудности с поставкой электроэнергии, они решат всё и устранят неполадку.

    Фирма center-energo.com оказывает услуги по разным направлениям. Во-первых сотрудники считаются опытными сотрудниками в отрасли присоединения электроснабжения.

    Специалисты узаконивают все юридические бумаги. Они смогут посодействовать вам, если вы только начали обслуживать квартал.

  8. Энергетическая компания «Центр Энергетических Решений и Инноваций» — это будущее в отрасли инновационных интернет-технологий. Если вам необходимо присоединить сообщества, райцентры, масштабные организации или даже промышленные организации – вы легко можете писать к администраторам.

    Сообщество на профессиональном уровне занимается [url=http://center-energo.com/products/stoimost_proektirovaniya]стоимость проектирования[/url] . Консультанты, которые числятся в компании – это сотрудники профессионалы.

    Они не только подключают сети, но и проектируют крупные проекты. С помощью их работы реально усилять электроснабжение, они специализируются на разных вопросах. Если у вас есть неловкие моменты с поставкой электроснабжения, они разберутся и ликвидируют неполадку.

    Компания center-energo.com оказывает услуги по разным вопросам. Во-первых специалисты являются квалифицированными сотрудниками в области присоединения электроснабжения.

    Менеджеры оформляют все нормативно-правовые бумаги. Они смогут помочь вам, если вы только начали обслуживать дом.

  9. Если вам требуется опытная поддержка менеджеров компании, которая занимается энергоснабжении, вам следует позвонить к специалистам! Ведь производительные моменты такой степени можно поручать только опытным специалистам!
    Именно они ведают информацией, каким должно быть энергоснабжение кварталов.

    Они способны организовать поставку электроэнергии на промышленные предприятия.
    Организация center-energo.com разбирается в [url=http://center-energo.com/articles/meropriyatiya_po_tehobslujivaniyu_transformatornyih_podstantsiy]техническое обслуживание трансформаторных подстанций[/url] и других нюансах.

    Сотрудники помогут не только создать электросети, но помогут и проектировать электросети в целом. Если вам требуется зарегестрировать документ, который вам давно не удаётся получить, написав к менеджерам компании процесс займёт небольшое количество время.

    В связи с этим фирма пользуется спросом и популярна в России.
    Если вам необходимо организовать проектирование и монтажные работы инженерных сетей, с помощью менеджеров реально понять, как всё будет происходить на деле, а специалисты на месте уточнят все тонкости и предполагаемые проблемы.

    [url=http://center-energo.com/articles/moesk___tehnologicheskoe_prisoedinenie]служба присоединения моэск[/url] – это не проблема для менеджеров фирмы ООО «Центр Энергетических Решений и Инноваций». Они быстро проведут все монтажные работы, а вы сможете получить удовольствие от такого уровня компании и уровня обслуживания.

  10. Бесплатная online-галерея снимков городов Кубани

    Кубань, без сомненья, в наше время является одним из красивейших регионов нашего огромного государства со своей уникальной и разнообразной историей, со следами множества исторических событий оставивших след в архитектурных формах, искусстве и пейзажах красивейших кубанских городов.

    И кроме прочего указанный регион знаменит огромным количеством природных пейзажей, многие из них достойны того, дабы быть увековечены на полотнах художников.

    Насладиться красотами незабываемой Кубани вам поможет галерея kuban.photography, где вы сможете вдоволь насладиться тысячи фото из региона, распределенных по основным подгруппам.

    Помимо всего вышесказанного на страницах сайта есть [url=https://kuban.photography/]фото кубань виноград[/url] и куча других полезных снимков.

  11. Организация «Мосэнерго Сити» функционирует в области электроснабжения не первый день. Если вам нужна динамичность специалистов, ежедневная информационная поддержка, ежедневное содействие во многих вопросах – рекомендуем позвонить к профессионалам.

    Организация занимается проектированием сетей и [url=http://mosenergocity.ru/dokumenty-neobhodimye-dlya-podklyucheniya-k-elektricheskim-setyam/]правила технологического присоединения к электрическим сетям[/url] , организовывает демонтаж объектов. Если вам требуется оформлять правовую базу, обратившись нам вы с легкостью сможете устранить ряд ваших нюансов.

    Позвонив в компанию, вы легко сможете получить ТУ. Несмотря на то, считаетесь ли вы физическим лицом либо юридическим лицом, команда профессионалов побеспокоятся о всех нюансах, организация соблюдает все сроки.

    В связи с этим электроснабжение будет организовано в сжатые сроки, а все акты на руки будущий владелец получит в сжатые сроки.

    Если вам требуется оформить акт разграничения балансовой принадлежности, а вы без понятия, что это такое и как с ним обращаться, предлагаем перейти на mosenergocity.ru

  12. Представительство «Мосэнерго Сити» функционирует в отрасли электроснабжения не первый день. Если вам нужна динамичность специалистов, регулярная техническая поддержка, ежедневное содействие во многих вопросах – советуем обратиться к профессионалам.

    Фирма занимается проектированием сетей и [url=http://mosenergocity.ru/tehnologicheskoe-prisoedinenie-k-ele/]техусловия на подключение электричества[/url] , выполняет монтажные работы. Если вам необходимо оформить нормативно-правовую базу, позвоним нам вы легко имеете возможнсть решить ряд ваших нюансов.

    Позвонив в фирму, вы с легкостью можете оформить ТУ. Несмотря на то, считаетесь ли вы физическим лицом либо юридическим лицом, специалисты позаботится о всех нюансах, компания считается со всеми сроками.

    По этой причине электроснабжение будет проведено в быстрые сроки, а все акты на руки клиент получит очень быстро.

    Если вам требуется получить договор разграничения балансовой принадлежности, а вы понятия не имеете, что это такое и куда вам стоит обращаться, советуем перейти на mosenergocity.ru

  13. Вселенная не стоит на месте и люди развивается во всех сферах. Если вы хотите следить о своем здоровье и стремитесь к тому, чтобы вести правильный образ жизни, вам нужно заходить на сайты, основная тематика которых – здоровый образ жизни.

    Правильный уклад жизни ведет к длительной жизни. Именно за счёт необходимого образа жизни популяция редко хворает, а его тело в кратчайшее время восстанавливается в былую форму после недугов.

    Быть худым и милым, при этом качественно и плотно завтракать – задача непростая. Из-за этого, чтобы сохранять фигуру – нужно заниматься физкультурой, фитнесом, если вы хотите быть здоровой, симпатичной и красивой, вам необходимо перейти на сайт о здоровье, где вы имеете возможность почитать [url=http://happy-womens.com/hrom-dlya-pohudeniya.html]хром таблетки для похудения[/url] . На портале о здоровье опубликованы статьи и параграфы о том, как сбросить лишние кг, и при этом быть здоровой.

    Ведь достаточно населения, которые начинают терять лишние кг, часто начинают болеть разными болезнями. Для того, чтоб этого не случилось, нужно обладать секретами здоровья и долголетия.

  14. Поднебесная не стоит на месте и люди прогрессирует в разных отраслях. Если вы желаете смотреть за своим здоровьем и стремитесь к тому, чтобы вести необходимый жизненный уклад, вам необходимо посещать ресурсы, целевая тема которых – ЗОЖ.

    Нужный уклад жизни ведет к долголетию. Именно за счёт нужного образа жизни человек редко болеет, а его тело быстрее восстанавливается в былую форму после болезней.

    Быть худым и милым, при этом качественно и достаточно кушать – задача не из легких. Из-за этого, чтобы держать свою фигуру в форме – требуется упражняться физическими упражнениями, фитнесом, если вы стремитесь быть успешной, симпатичной и красивой, вам нужно кликнуть на портал о здоровье, где вы имеете возможность перечитать [url=http://happy-womens.com/dieta-pri-otravlenii.html]диета при отравлении у взрослых[/url] . На сайте о здоровом образе жизни написаны записи и разделы про то, как сбросить вес, и при этом жить здоровой.

    Ведь большое количество аудитории, которые начинают сбрасывать лишний вес, часто заболевают разными недугами. Чтобы этого не случилось, нужно владеть секретами ЗОЖ и долголетия.

  15. Отпуск на Кипре в в последние годы активно популярен. Кипр является островным участком, который принимает целевую аудиторию радужной погодой, вкусной едой и культурным штатом кадров.

    Отдыхая на острове, вы сможете вести себя спокойно, а погода разных мест предаёт этому вкус.
    Если вы приняли решение посетить остров в первый раз, рекомендуется побывать в крупных городах.

    Сделать это вовремя поездки очень тяжело. Ведь на Кипре всегда большой приток людей, и ездить электричками часто становится нелегко, в связи с этим советуем пользоваться услугами организаций, какие предлагают транспорт в аренду.

    Отпуск на Кипре отпечатается вам на долгое время. Ведь воздух здесь является приемлемым, практически целый год здесь есть солнце, зима тут очень приятная, а условия температуры в зиму от +17 до +19°C.

    Помимо этого можно посмотреть на kiprus.pro всю нужную инфу о [url=https://kiprus.pro/online-tours/]онлайнстурс[/url] , посетить любой местность на Кипре легко. Вам нужно зайти на сайт и воспользоваться услугами лизинга автомобиля rentcar.kiprus.pro

    На сайте хватает предложений. Если вы решили посетить Кипр от организации «Kiprus», вам будет предоставлен индивидуальный менеджер. С ним вы имеете возможность консультироваться по всем моментам.

    Именно этот менеджер подскажет, где лучше поездить и какие места посетить. А если у вас проявится охота взять в лизинг авто, менеджер даст совет, какой выгоднее транспорт будет взять в аренду в зависимости от того, куда вы будете держать путь на Кипре.

  16. Мир многопользовательских игр сегодня разделился на два основных лагеря: десятки миллионов людей играют с требовательные игры с детализированной графикой, используя для этих целей очень мощные комплектующие, но ещё больше активных пользователей набирают простые казуальные игры для мобильных телефонов.

    Не смотря на это между двумя этими лагерями есть куча места для ценителей классических игр, которые способны работать прямо в браузере на любом компьютере, но для запуска не будут требовать запредельно дорогого железа.

    Речь идет про [url=http://flashda.ru/1561-domino-latino.html]Домино латино играть бесплатно без регистрации на весь экран онлайн[/url] , что работают на основе технологии flash, которые вы в любом количестве отыщете на ресурсе flashda.ru

    Сотни игр, что находятся в категории пазлов помогут вам скоротать день на скучной офисной работе или повеселить ребенка в длинный осенний вечер. Тематика пазлов в данном разделе сайта так разнообразна, что они смогут удовлетворить как серьёзных мужчин, так и маленьких пятилетних девочек.

  17. Мир массовых игр сейчас разделен на два больших лагеря: миллионы людей играют с требовательные игрушки с детализированной графикой, используя для этой задачи новейшие железки, но в сто раз больше активных игроков набирают обычные казуальные игрушки для мобильников.

    Не смотря на это между двумя названными лагерями есть пространство для ценителей несложных игр, которые способны работать прямо в браузере на любом компьютере, но для работы не будут требовать высокопроизводительного железа.

    Речь идет про [url=http://flashda.ru/1941-umnye-shahmaty-igra.html]Умные шахматы играть с компьютером бесплатно на весь экран онлайн[/url] , работающие на основе технологии flash, которые вы в изобилии найдете на портале flashda.ru

    Сотни игр, что расположены в категории с пазлами помогут вам убить день на нудной работе в офисе или развеселить ребенка в скучный зимний вечер. Тематика пазлов в указанном разделе сайта так разнообразна, что они смогут увлечь как взрослых мужчин, так и маленьких пятилетних детишек.

  18. Теперь только один шаг отделяет Вас от первых полученных денег. Если вы сделаете его прямо сейчас, то уже через час вы сможете получить первую сотню долларов… подробнее вот

  19. Великолепный сетевой магазин для сыроедения

    Привычный стиль жизни подавляющего большинства современных людей и ужасная экологическая обстановка в городах никак не предрасполагает к тому, чтобы тело человека было здоровым и сильным. Если вы желаете хорошо себя чувствовать и не знать болячек до собственной старости нужно ежедневно работать над своим телом, привычками и всеми остальными компонентами образа жизни.

    Магазин daryzemli.ru трудится для того, чтобы вы смогли купить полезные и экологически безвредные товары, что смогут помочь вам вести правильный стиль жизни и сохранять молодость на несколько десятков лет.

    В отделе сайта, отведенном правильному питанию, вы сможете отыскать сотни видов питательных продуктов, что изготавливаются из натурального сырья и не содержат запрещенных консервантов. Каталог продуктов в указанном магазине так объемен, что вы без лишнего труда сможете составить для себя сложное меню, которое вам попросту не надоест.

    Не надо думать что веганское питание — это только пресные овощи и курица, сегодня питаться правильно можно без лишних трудностей. На описанном портале daryzemli.ru можно заказать даже замечательные сладости и [url=https://daryzemli.ru/catalog/poleznye_sladosti_i_sneki/konfety/]сыроедческие конфеты[/url] , что совсем не вредят вашей фигуре.

    Прекрасная часть человечества не может жить без косметики, потому как она помогает продлить молодость и поддержать дамскую красоту. На страничках daryzemli.ru в одноименном отделе девушки сумеют найти десятки наименований классной природной косметики, что работает отлично, но не содержит никаких вредных веществ.

  20. Лучший онлайн магазин для сыроедения

    Обычный образ жизни большей части современных людей и трудная экологическая обстановка в столице никак не предрасполагает к тому, чтобы все ваше тело было идеально здоровым. Если вы желаете нормально себя чувствовать и не знать никаких болезней до самой старости нужно активно трудиться над своим телом, привычками и иными составными частями образа жизни.

    Сетевой магазин daryzemli.ru был открыт для того, чтобы вы могли купить полезные и экологически безопасные товары, которые смогут помочь вам вести правильный тип жизни и сберегать молодость на несколько десятков лет.

    В разделе магазина, отведённом грамотному питанию, вы можете найти сотни вариантов продуктов питания, что производятся из натурального сырья и не содержат опасных консервантов. Каталог продуктов в этом магазине настолько широк, что вы без лишнего труда можете составить себе сложное меню, которое вам ни за что не надоест.

    Не нужно думать что правильное питание — это лишь пресные овощи и курица, в наши дни питаться вкусно и правильно можно без лишних трудностей. На описанном ресурсе daryzemli.ru можно найти даже замечательные сладости и [url=https://daryzemli.ru/catalog/zdorovoe_pitanie/kakao_i_kerob/]роял форест кэроб[/url] , которые никак не вредят нормальной фигуре.

    Красивая половина человечества не может жить без косметики, что помогает продлить молодость и поддержать дамскую красоту. На страничках daryzemli.ru в соответствующем отделе девушки смогут отыскать десятки наименований классной природной косметики, что действует отлично, но не содержит никаких опасных веществ.

  21. Большое количество молодых людей сегодня желают очень быстро сдать экзамен на права, как только они достигнут необходимого статуса. С помощью возможности начать водить, они с разного возраста складывают деньги на автомобиль или мопед.

    Мечта большого количества сегодня – купить красивый мотоцикл и выполнить из него «конфетку». Но вот тут у многих возникают сложности, ведь несмотря на то, какой тип транспорта, за частую – он не современный.

    А допустить ошибку по малому опыту может будь кто на сегодняшний день.
    Если вы желаете провести тюнинг авто или мопеда, рекомендуем обратить внимание [url=http://artfg.ru/portfolio-view/tyuning-mototsikla-bmw-1200-lt]тюнинг мотоциклов бмв[/url] все это сделать или узнать как это делать вы можете на нашем сайте.

    Именно тут вам порекомендуют и подскажут, осведомят стоимость работ и про тюнингуют ваш транспорт.
    Производить тюнинг автомобиля – на сегодняшний день основная ниша, которую получить желают многие. Тем не менее, для того, чтоб стать настоящими профессионалами – нужен опыт.

    В тюнинг ателье ARTFG «artfg.ru» устроены сотрудники с опытом, которые умеют тюнинговать как современные авто, так и старые, тюнинг авто в возрасте очень популярны, так как благодаря нему возможно перекрашивать части автомобиля, ликвидировать ржу, омолодить авто «под капотом».

  22. Техника Apple является очень популярными сегодня. Большое количество виртуальных магазинов желают торгашить этих гаджетов, так как с помощью них можно получить хорошую прибыль.

    Маржа на гаджеты Apple очень высокая, поэтому много виртуальных магазинов занимаются перепродажей техники в Российской Федерации. Однако, основная проблема есть в том, что обслуживание гаджетов Apple в России не на крутом уровне. Хотя, квалифицированные сервисы есть. Не во всех городах, однако, но большинстве городов при сетевых магазинах есть сервисные центры, которые обслуживают техники Apple.

    Виртуальный магазин iShop занимается перепродажей iPhone и других принадлежностей для гаджетов Apple [url=http://ishop124.ru/zamena-remont-dinamika/]ремонт iphone цена[/url] , также, вы сможете пользоваться и услугами по ремонту. При сетевой лавке существует опытный сервис, который помогает устранить проблемы с техникой Apple любой сложности.

    Если вы желаете приобрести new iphone, но не знаете где это лучше сделать – рекомендуем заглянуть в ishop124.ru на сайте можно найти хорошую технику, а восстановленной гаджетами не торгует. Тем не менее, если у вас есть цель приобрести восстановленный apple гаджет, сотрудники подскажут, где это лучше сделать.

    На портале интернет — маркета вы сможете посмотреть айфоны последних models. На портале кроме этого есть блок, который посвящен гаджетам со скидкой.

  23. Найти проверенный сайт, где возможно скачать games – стало нелегко. Ведь большинство порталов максимально вмещают баннеры на свои интернет-сайты, в связи с этим стремление делать переход и скачивать что-то с сайта исчезает.

    Один из проверенных torrent реально найти тут. Этот ресурс регулярно обновляется новыми играми по разным направлениям, вы можете найти крутую игру, почитать описание на сайте и скачать её очень быстро.

    На портале существуют меню. Необходимо просмотреть тот блок, который вам интересный, после чего вы запросто можете почитать описание, затем сохранить игру. Менеджеры портала смотрят за многими материалами и хотят давать аудитории только самые интересные игры.

    Вы имеете возможность найти для друзей стратегии как на английском языке, так и на национальном. Много шутеров возможно найти [url=http://torrent9.ru/load/simuljatory/2156-farming-simulator-017.html]семейный фермер программа скачать игра 2017[/url] здесь, по причине того, что на сайте представлены games по любым направлениям игрушек на сайте хватает.

    Torrent9.ru – это сайт, который уже получил доверие у публики. Он известен среди других ресурсов и является известным по причине того, что на нём отсутствует реклама, а эксплуатация сайта предоставляет шанс сохранить любую игрушку в кратчайшие сроки.

  24. Встретить проверенный ресурс, где возможно списать games – стало тяжело. Ведь большое количество ресурсов максимально прут рекламу на свои интернет-сайты, поэтому желание делать переход и перекачивать игры с сайтов уходит.

    Один из весомых торрентов можно просмотреть тут. Этот портал регулярно обновляется новинками по разным разделам, вы имеете шанс скачать интересную игрушку, прочитать обзор на портале и перекачать её очень быстро.

    На портале находятся разделы. Нужно просмотреть тот раздел, который вам интересный, после чего вы легко можете прочитать описание, затем скачать игру. Руководство портала смотрят за всеми играми и хотят давать аудитории исключительно самые интересные игры.

    Вы имеете шанс найти для себя игры как на иностранном языке, так и на русском. Большинство шутеров можно найти [url=http://torrent9.ru/load/simuljatory/2441-sims-1.html]симс 1 скачать торрент[/url] на сайте, по причине того, что на портале находятся игрушки по разным направлениям игр на web-site достаточно.

    Torrent9.ru – это ресурс, который уже получил доверительное отношение у публики. Он известный среди других ресурсов и есть знаменитым в связи с тем, что на нём нет рекламы, а использование портала предоставляет шанс скачать любую игрушку в кратчайшие сроки.

  25. Инвестиционно — строительная организация занимается проектированием и реализацией новых помещений в новостройках. У вас есть хорошая возможность быть обладателем одной из элитных квартир.

    Компания находится в Ульяновске и в ней устроены менеджеры с перспективными идеями, сотрудники обладают информацией, как необходимо вести дела и на что обращать внимание, а так же расскажут про [url=http://mebelrooms.com.ua/elektrokamini/]купить камин в гостиную[/url] . Компания занимается строительством любого вида жилья.

    Так как организация ведет себя корректно и есть честным застройщиком, компания числится в реестре Министерства строительства региона.
    Компания работает в этой отрасли не первый день. У предприятия есть необходимая репутация, которую сотрудники получили за длительное время.

    Перед тем, как возводить объект, сотрудники исследуют его. Вы имеете шанс не колебаться, что все новостройки находятся в хорошей местности. Только там хорошая инфраструктура, хватает школьных учреждений и яслей.

    В краю хорошо развивается медицинская отрасль, по этой причине все новостройки строятся с учетом того, чтобы быстро добраться до рентгена.
    Общество ИСК «Премьера» возводит комфортные новостройки, которые будут находится на территории не одно десятилетие. Только эти дома удобные, так как возле них есть детские садики, обустроена хорошо инфраструктура.

  26. Строительная фирма занимается проектированием и реализацией квартир в новых зданиях. У вас есть универсальная возможность стать владельцем одной из элитных апартаментов.

    Компания работает в Ульяновске и в ней устроены сотрудники с большим опытом, администраторы обладают информацией, как правильно вести механизм работы и на что следует обратить внимание, а так же расскажут про [url=http://mebelrooms.com.ua/kruglij_pristavnoj_stolik_sunrise__italija.html]купить приставной столик[/url] . Компания занимается строительством любого вида жилья.

    Потому как фирма ведет себя честно и является благополучным застройщиком, она числится в реестре Министерства строительства региона.
    Компания работает в этой отрасли не первый год. У общества есть нужная репутация, которую специалисты получили за долгое время.

    До того, как строить новостройку, сотрудники исследуют его. Вы можете не колебаться, что все дома находятся в благоприятной местности. Только там развита инфраструктура, много школьных учреждений и детских садиков.

    В краю перспективно развивается медицина, в связи с этим все новостройки строятся с учетом того, чтобы как можно быстрее дойти до медицинского учреждения.
    Общество ИСК «Премьера» создаёт классные новостройки, которые будут находится на территории не одно десятилетие. Только эти дома уютные, так как возле них находятся детские ясли, обустроена хорошо инфраструктура.

  27. Мебельное дело в Украине особо важно. Большое количество фирм развиваются каждый день и улучшают свою товары на рынке.

    Если вы желаете купить мебель или уже готовые мебельные товары – рекомендуем обратиться к опытным людям.

    Если у вас есть интерес обновить свою кровать, есть возможность приобрести диван из хорошего материала или же [url=http://mebelrooms.com.ua/kabinet_dlja_doma/]шкафы для кабинета в доме[/url] . На особом счету на сегодняшний день дерево.

    Товары из массива дуба и ясеня известны активно в последнее время. Если вы хотите сделать заказ кровать из сосны, это также возможный вариант, однако этот материал может расслоиться через года.
    Предприятие предоставляет типы услуг разного направления.

    Сотрудники занимаются мебельным бизнесом довольно таки давно и успели завоевать доверие. По этой причине у них постоянно заказывают много вещей из дерева. К ним реально присоединить и вешалки в прихожие, и стенки.

    Если вы стремитесь подготовить обновлённую комнату, лучшим вариантом будет остановиться на том, чтобы создать для деток отдельное помещение. Основным принадлежностью в детской есть кровать, тем не менее, большое количество людей предпочитают ставить в детское помещение диван. Если вы хотите порадовать ваших деток, предлагаем глянуть и на торшеры.

    В этом сезоне часто продаются осветительные приборы и торшеры. Выбрать их реально на mebelrooms.com.ua, ведь здесь постоянно масштабный ассортимент мебели.

  28. Мебельная индустрия в украинском государстве особо важно. Большинство организаций развиваются каждый день и продвигают свою продукцию на рынке.

    Если вы стремитесь купить мебельные комплекты или уже готовые спальни – рекомендуем написать к опытным людям.

    Если у вас есть интерес обновить свою кровать, есть желание заказать мебель из любого материала или же [url=http://mebelrooms.com.ua/spalni__krovati/modern-21/]модерн 1 спальня[/url] . Особое уважение заслуживает на сегодня дерево.

    Товары из массива дуба и бука популярны активно в последние месяцы. Если вы хотите сделать заказ кровати из сосны, это также возможный вариант, однако сосна имеет свойство расслоиться через года.
    Организация предлагает типы услуг разного направления.

    Сотрудники занимаются мебелью довольно таки давно и успели завоевать доверие. Поэтому у них постоянно заказывают много товаров из дуба. К ним реально присоединить и вешалки в прихожие, и стенки.

    Если вы хотите облагородить обновлённую квартиру, лучшим вариантом будет сосредоточиться на том, чтобы подготовить для детей отдельную комнату. Главным мебелью в детской считается кровать, хотя, большое количество людей предпочитают установить в детское помещение диван. Если вы стремитесь порадовать ваших детей, предлагаем также обратить внимание и на торшеры.

    В этом сезоне часто продаются лампы и торшерчики. Найти их доступно на mebelrooms.com.ua, ведь здесь всегда масштабный ассортимент мебели.

  29. Everyone must to visit Spain. If you didn’t attendance Mallaga before, you must to do it. Spain – it is one of the countries which located in southwest part of EU.

    The Kingdom of Spain it is the land which consists in European Union and NATO.
    There are a lot of famous towns in The Kingdom of Spain. Among them: Madrid – the central city of The Kingdom of Spain, Barcelona, Valencia, Seville.

    Every month a lot of travellers going on to different towns in Spain. Against for meteorology, visitors from Africa, USA, Canada like to hold on time in this nation.
    However, Spain is hot Kingdom, by this reason the temp in winter season does not drop below 10 ° C. As this country located in Europe, a lot of people from Germany like to be in Spain in the December too.

    To see information about them you receive possibility at spainwalk.com, there are a lot of famous materials [url=https://spainwalk.com/nightclub-guru-valencia/]valencia entertainment center[/url] about guest houses or other infrastructure in Barca.

  30. Everybody must to attendance Spain. If you didn’t attendance Barcelona before, you need to do it. Spain – it is one of the countries which disposed in southwest part of European Union.

    The Kingdom of Spain it is the nation which compounds in European Union and NATO.
    There are a lot of famous cities in The Kingdom of Spain. Among them: Madrid – the capital of The Kingdom of Spain, Barcelona, Valencia, Seville.

    Every month a lot of travellers going on to different towns in Spain. Despite for meteorology, tourists from Africa, USA, Japand like to hold on time in this country.
    However, the Kingdom of Spain is lukewarm country, by this reason the temp in winter season does not drop below 10 ° C. As this land situated in Europe, a lot of guests from France like to be in Kingdom of SPain in the December too.

    To see information about them you receive possibility at spainwalk.com, on web-site are a lot of famous materials [url=https://spainwalk.com/hotel-unico-madrid/]hotel unico madrid[/url] about museums or other infrastructure in different Spanish cities.

  31. Если у вас есть желание красиво провести выходные, предлагаем посетить men’s club «Provocateur».

    Этот клуб уже завоевал доверие среди населения которые ищут [url=https://provocateur18.ru]где можно покурить кальян и покушать[/url] . Вы легко имеете шанс провести пятничный вечер или отпраздновать хорошо вечер.

    Если вы стремитесь провести мальчшник в перед браком в мужской компании, то этот мальчишник будет безумно крутым именно в этом клубе!
    В клубе PROVOCATEUR работают только самые красивые девушки.

    У вас есть возможность выбрать кальян или прослушать одним из первых музыкальные хиты. В мужском заведении достаточно много предложений для тех, кто ранее не был в таких местах.

    Вы без проблем имеете шанс заказать столик заранее.

    На сайте есть фотогалерея, где вы имеете возможность посмотреть фотоснимки последних мероприятий. Очень много акционных предложений существует для тех, кто желает отпраздновать свой праздник в составе коллег, так же у нас .

  32. Если у вас есть интерес хорошо провести выходные, предлагаем навестить men’s club «Provocateur».

    Этот клуб уже получил доверие у населения которые искали [url=https://provocateur18.ru]где покурить кальян ижевск[/url] . Вы легко имеете шанс провести пятничный вечер или отпраздновать праздник.

    Если вы хотите провести мальчшник в перед браком в компании друзей, то мальчишник будет невероятно крутым именно здесь!
    В мужском клубе PROVOCATEUR трудоустроены только самые красивые барышни.

    У вас есть возможность выбрать кальян или прослушать первыми музыкальные новинки. В клубе достаточно много идей для тех, кто до этого не был в подобных заведениях.

    Вы с легкостью можете заказать столик за несколько дней.

    На сайте есть фотогалерея, где вы имеете возможность посмотреть фотоснимки последних мероприятий. Очень много акций есть для тех, кто желает отпраздновать свой ДР в компании коллег, так же у нас .

  33. Приветствуем мы советуем зайти на web-site UGG Австралия в России. Угги в РФ на сегодняшний день считаются очень известной демесезонной обувью и их принято покупать в зиму.

    Однако, сегодня фирма получила доверие многих и начала выпускать ассортимент разной обуви.

    У нас на сайте [url=http://ugg.msk.ru/ugg-australia]купить ugg australia оригинал без подделок официальный сайт[/url] вы найдёте много качественных товаров. К тому же, на сейчас проходят sales на большое количество товаров.

    Отдельно необходимо отметить, что добротные мужские и женские ugg участвуют в распродаже.
    Угги, которые мы реализуем – считаются комфортной обувью, которую мы принимаем от производителей.

    Сегодня в России это – редкость. Найти комфортные варианты [url=http://ugg.msk.ru/]купить угги на распродаже[/url] реально на ugg.msk.ru, ведь на портале находится большой ассортимент угг и не только.

    Необходимо добавить, что у нас на портале на этой недели новые поставки, где вы имеете возможность найти премиум uggi 2018 года.

  34. Добрый вечер мы рекомендуем перейти на веб-сайт магазина UGG Австралия в России. Эта обувь сегодня являются очень качественной демесезонной обувью и их возможно носить в зиму.

    Хотя, на сегодняшний день фирма завоевала доверительное отношение многих и начала производить ассортимент разной обуви.

    У нас на сайте [url=http://ugg.msk.ru/]угги камни купить[/url] вы сможете найти большое количеств добротных предложений. К тому же, сейчас проходят sales на много вариантов.

    Отдельно надо сказать, что любые мужские и женские угги находятся в распродаже.
    Угги, которые мы продаём – есть качественной обувкой, которую мы принимаем от поставщиков.

    Сегодня в РФ это – редкость. Выбрать комфортные угги [url=http://ugg.msk.ru/]официальные магазины ugg[/url] возможно на ugg.msk.ru, ведь на ресурсе находится крупный ассортимент UGG и не только.

    Отдельно надо сказать, что у нас на ресурсе на этой недели новый завоз, где вы сможете выбрать крутые варианты 2018 года.

  35. Добрый день! Если вы испытываете проблемы с потенцией, или же ваша жена не испытывает сексуального наслаждения, не переживайте у вашей проблемы есть решение!

    Интернет — аптека поможет вам разрешить это быстро и безопасно. В нашей аптеке вы можете купить дженерик виагра, а так же женскую виагру и [url=http://mister-vig.ru/]виагра таблетки для мужчин цена в аптеке москва[/url] по самым низким ценам в РФ!

    Если вы быстро заканчиваете половой акт то вам подойдет таблетки для продления полового акта.

    Заходите к нам на сайт [url=http://mister-vig.ru/]mister-vig.ru[/url] который работает с 2010 года! У нас быстрая доставка курьером по Москве быстро и безопасно.

    Гарантия 100% на все приобретенные таблетки! По регионам России у нас доставка почтой, анонимно в непрозрачном пакете!

    Mister-vig.ru мы работаем чтоб у вас была отличная сексуальная жизнь!

  36. Здравствуйте! Если у вас есть проблемы с потенцией, или же ваша жена не испытывает сексуального наслаждения, не надо переживать у вашей проблемы есть решение!

    Наша интернет аптека поможет вам преодолеть это быстро и безопасно. У нас вы можете купить дженерик левитра, а так же женская виагра и [url=http://mister-vig.ru/]купить виагру пфайзер в москве[/url] по самым низким ценам в РФ!

    Если у вас проблема заключается в том что вы быстро заканчиваете половой акт то вам подойдет дапоксетин для продления секса.

    Заходите к нам на сайт [url=http://mister-vig.ru/]mister-vig.ru[/url] который работает с 2010 года! У нас быстрая доставка курьером по Москве быстро и безопасно.

    Мы даем 100% гарантию на все приобретенные препараты! По РФ действует доставка почтой первого класса, анонимно в непрозрачном пакете!

    Mister-vig.ru мы стараемся сделать вашу сексуальную жизнь лучше!

  37. Do yоu mind if I quote a ffew of yoᥙr posts ass long as I provide credit
    аnd sources bɑck to your site? My blog iss in tһe ѵery samе areea օf
    interest as yours and my ᥙsers ԝould genuinely benefit fгom a
    ⅼot of the information yoou provide heгe. Рlease let me ҝnow іf this
    oқay witһ you. Ƭhank you!

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *