AVX-512 Extension Proposals

This article presents proposals for possible future AVX-512 extensions that would help with either general use cases or a very specialized ones. These extensions currently don't exist. Each section contains a possible extension name, lists instructions that would be provided by such extension, and provides code that uses these instructions as an example.

AVX512_VPMOVH (General Purpose)

Widening moves of high part of the register:

VPMOVHSXBW xmm1 {k1}{z}, xmm2 ; sign-extend high 8-bit integers to 16-bit integers
VPMOVHSXBW ymm1 {k1}{z}, ymm2 ; sign-extend high 8-bit integers to 16-bit integers
VPMOVHSXBW zmm1 {k1}{z}, zmm2 ; sign-extend high 8-bit integers to 16-bit integers
VPMOVHSXWD xmm1 {k1}{z}, xmm2 ; sign-extend high 16-bit integers to 32-bit integers
VPMOVHSXWD ymm1 {k1}{z}, ymm2 ; sign-extend high 16-bit integers to 32-bit integers
VPMOVHSXWD zmm1 {k1}{z}, zmm2 ; sign-extend high 16-bit integers to 32-bit integers
VPMOVHSXDQ xmm1 {k1}{z}, xmm2 ; sign-extend high 32-bit integers to 64-bit integers
VPMOVHSXDQ ymm1 {k1}{z}, ymm2 ; sign-extend high 32-bit integers to 64-bit integers
VPMOVHSXDQ zmm1 {k1}{z}, zmm2 ; sign-extend high 32-bit integers to 64-bit integers

VPMOVHZXBW xmm1 {k1}{z}, xmm2 ; zero-extend high 8-bit integers to 16-bit integers
VPMOVHZXBW ymm1 {k1}{z}, ymm2 ; zero-extend high 8-bit integers to 16-bit integers
VPMOVHZXBW zmm1 {k1}{z}, zmm2 ; zero-extend high 8-bit integers to 16-bit integers
VPMOVHZXWD xmm1 {k1}{z}, xmm2 ; zero-extend high 16-bit integers to 32-bit integers
VPMOVHZXWD ymm1 {k1}{z}, ymm2 ; zero-extend high 16-bit integers to 32-bit integers
VPMOVHZXWD zmm1 {k1}{z}, zmm2 ; zero-extend high 16-bit integers to 32-bit integers
VPMOVHZXDQ xmm1 {k1}{z}, xmm2 ; zero-extend high 32-bit integers to 64-bit integers
VPMOVHZXDQ ymm1 {k1}{z}, ymm2 ; zero-extend high 32-bit integers to 64-bit integers
VPMOVHZXDQ zmm1 {k1}{z}, zmm2 ; zero-extend high 32-bit integers to 64-bit integers

These instructions would complement instructions designed for integer widening. The SSE4.1 extension initially introduced instructions for widening moves (widening from 8-bit, 16-bit, and 32-bit quantities) and these were later extended by AVX2 and AVX-512. However, these widening instructions only widen the low part of each input register, so if you have data in the high part as well it has to be extracted first into a separate register. Such extraction means that one more instruction is required, and these instructions usually fall into the permutation category, which are costly in terms of latency.

; Widening 8-bit integers in zmm2 into 16-bit integers in zmm0 and zmm1 in the current AVX-512:
VEXTRACTI64X4 ymm1, zmm2 ; Extract the high half of the input register
VPMOVZXBW zmm0, ymm2     ; Widen the low part of zmm2
VPMOVZXBW zmm1, ymm1     ; Widen the extracted high part (ymm1) of zmm2

; Widening 8-bit integers in zmm2 into 16-bit integers in zmm0 and zmm1 with the proposed extension:
VPMOVZXBW zmm0, ymm2     ; Widen the low part of zmm2
VPMOVHZXBW zmm1, zmm2    ; Widen the high part of zmm2

Narrowing moves of two registers into one:

VPMOV2WB xmm1 {k1}{z}, xmm2, xmm3 ; truncate 16-bit integers into 8-bit integers (two sources version)
VPMOV2WB ymm1 {k1}{z}, ymm2, ymm3 ; truncate 16-bit integers into 8-bit integers (two sources version)
VPMOV2WB zmm1 {k1}{z}, zmm2, zmm3 ; truncate 16-bit integers into 8-bit integers (two sources version)
VPMOV2DW xmm1 {k1}{z}, xmm2, xmm3 ; truncate 32-bit integers into 16-bit integers (two sources version)
VPMOV2DW ymm1 {k1}{z}, ymm2, ymm3 ; truncate 32-bit integers into 16-bit integers (two sources version)
VPMOV2DW zmm1 {k1}{z}, zmm2, zmm3 ; truncate 32-bit integers into 16-bit integers (two sources version)
VPMOV2QD xmm1 {k1}{z}, xmm2, xmm3 ; truncate 64-bit integers into 32-bit integers (two sources version)
VPMOV2QD ymm1 {k1}{z}, ymm2, ymm3 ; truncate 64-bit integers into 32-bit integers (two sources version)
VPMOV2QD zmm1 {k1}{z}, zmm2, zmm3 ; truncate 64-bit integers into 32-bit integers (two sources version)

VPMOV2SWB xmm1 {k1}{z}, xmm2, xmm3 ; saturate signed 16-bit integers into signed 8-bit (two sources version)
VPMOV2SWB ymm1 {k1}{z}, ymm2, ymm3 ; saturate signed 16-bit integers into signed 8-bit (two sources version)
VPMOV2SWB zmm1 {k1}{z}, zmm2, zmm3 ; saturate signed 16-bit integers into signed 8-bit (two sources version)
VPMOV2SDW xmm1 {k1}{z}, xmm2, xmm3 ; saturate signed 32-bit integers into signed 16-bit (two sources version)
VPMOV2SDW ymm1 {k1}{z}, ymm2, ymm3 ; saturate signed 32-bit integers into signed 16-bit (two sources version)
VPMOV2SDW zmm1 {k1}{z}, zmm2, zmm3 ; saturate signed 32-bit integers into signed 16-bit (two sources version)
VPMOV2SQD xmm1 {k1}{z}, xmm2, xmm3 ; saturate signed 64-bit integers into signed 32-bit (two sources version)
VPMOV2SQD ymm1 {k1}{z}, ymm2, ymm3 ; saturate signed 64-bit integers into signed 32-bit (two sources version)
VPMOV2SQD zmm1 {k1}{z}, zmm2, zmm3 ; saturate signed 64-bit integers into signed 32-bit (two sources version)

VPMOV2USWB xmm1 {k1}{z}, xmm2, xmm3 ; saturate unsigned 16-bit integers into unsigned 8-bit (two sources version)
VPMOV2USWB ymm1 {k1}{z}, ymm2, ymm3 ; saturate unsigned 16-bit integers into unsigned 8-bit (two sources version)
VPMOV2USWB zmm1 {k1}{z}, zmm2, zmm3 ; saturate unsigned 16-bit integers into unsigned 8-bit (two sources version)
VPMOV2USDW xmm1 {k1}{z}, xmm2, xmm3 ; saturate unsigned 32-bit integers into unsigned 16-bit (two sources version)
VPMOV2USDW ymm1 {k1}{z}, ymm2, ymm3 ; saturate unsigned 32-bit integers into unsigned 16-bit (two sources version)
VPMOV2USDW zmm1 {k1}{z}, zmm2, zmm3 ; saturate unsigned 32-bit integers into unsigned 16-bit (two sources version)
VPMOV2USQD xmm1 {k1}{z}, xmm2, xmm3 ; saturate unsigned 64-bit integers into unsigned 32-bit (two sources version)
VPMOV2USQD ymm1 {k1}{z}, ymm2, ymm3 ; saturate unsigned 64-bit integers into unsigned 32-bit (two sources version)
VPMOV2USQD zmm1 {k1}{z}, zmm2, zmm3 ; saturate unsigned 64-bit integers into unsigned 32-bit (two sources version)

These instructions complement the existing VPMOVxxx instructions that narrow integers from one register and store the result into the low part of the destination. Since one more input register has been added the narrowing operation would store to the full length of the output register as there is enough inputs.

; Narrowing 16-bit integers from zmm1 and zmm2 into zmm0 in the current AVX-512:
VPMOVWB ymm0, zmm1               ; Narrowing the first register (zmm1) into ymm0
VPMOVWB ymm9, zmm2               ; Narrowing the second register into a temporary (ymm9)
VINSERTI64X4 zmm0, zmm0, ymm9, 1 ; Combining the two registers together

; Narrowing 16-bit integers from zmm1 and zmm2 into zmm0 with the proposed extension:
VPMOV2WB zmm0, zmm1, zmm2        ; Narrow both sources into a single destination

Conclusion (AVX512_VPMOVH)

Maybe these look like a very minor additions, but I have seen a lot of code that had exactly these constructs and where a better widening and narrowing operations would help to reduce the latency of the code and the number of memory operations (as today it's more practical to load 256-bit data and widen it to 512-bit register and to narrow 512-bit data into 256-bit memory destination with AVX-512). It seems that architecturally these should not be a problem as some new instructions introduced by AVX10.2 already provide two-source narrowing operations, but indeed for a different data type (FP8).

AVX512_VPACCSHIFT (General Purpose)

Shifts with accumulation.

VPADDSRA[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 right (arithmetic) by imm bits, add with xmm2, and store in xmm1
VPADDSRA[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 right (arithmetic) by imm bits, add with ymm2, and store in ymm1
VPADDSRA[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 right (arithmetic) by imm bits, add with zmm2, and store in zmm1

VPADDSRL[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 right (logical) by imm bits, add with xmm2, and store in xmm1
VPADDSRL[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 right (logical) by imm bits, add with ymm2, and store in ymm1
VPADDSRL[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 right (logical) by imm bits, add with zmm2, and store in zmm1

VPADDSLL[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 left (logical) by imm bits, add with xmm2, and store in xmm1
VPADDSLL[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 left (logical) by imm bits, add with ymm2, and store in ymm1
VPADDSLL[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 left (logical) by imm bits, add with zmm2, and store in zmm1

VPSUBSRA[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 right (arithmetic) by imm bits, sub with xmm2, and store in xmm1
VPSUBSRA[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 right (arithmetic) by imm bits, sub with ymm2, and store in ymm1
VPSUBSRA[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 right (arithmetic) by imm bits, sub with zmm2, and store in zmm1

VPSUBSRL[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 right (logical) by imm bits, sub with xmm2, and store in xmm1
VPSUBSRL[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 right (logical) by imm bits, sub with ymm2, and store in ymm1
VPSUBSRL[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 right (logical) by imm bits, sub with zmm2, and store in zmm1

VPSUBSLL[B|W|D|Q] xmm1 {k1}{z}, xmm2, xmm3, imm ; shift xmm3 left (logical) by imm bits, sub with xmm2, and store in xmm1
VPSUBSLL[B|W|D|Q] ymm1 {k1}{z}, ymm2, ymm3, imm ; shift ymm3 left (logical) by imm bits, sub with ymm2, and store in ymm1
VPSUBSLL[B|W|D|Q] zmm1 {k1}{z}, zmm2, zmm3, imm ; shift zmm3 left (logical) by imm bits, sub with zmm2, and store in zmm1

These instructions would combine shift operation with accumulation (addition or subtraction). The motivation for having these is that a lot of existing code actually shifts integers for the purpose of accumulation. This is not just good for combining two relatively simple operations into a single one (which would result in improved latency), this would also be good for reducing the number of constants required by SIMD code as in many cases it's required to allocate more registers to hold constants that can be easily calculated from other constants by shifting (for example all powers of 2 can be calculated on-the-fly by shifting 1 left).

; Shift 16-bit elements of zmm1 right by 8 bits and accumulate the result into zmm0 in the current AVX-512:
VPSRLW zmm9, zmm1, 8    ; Shift 16-bit elements of zmm1 by 8 and store the result into a temporary zmm9
VPADDW zmm0, zmm0, zmm9 ; Accumulate the temporary result in zmm9 into zmm0

; Shift 16-bit elements of zmm1 right by 8 bits and accumulate the result into zmm0 with the proposed extension:
VPADDSRLW zmm0, zmm0, zmm1, 8 ; Shift 16-bit elements of zmm1 by 8 and accumulate the result in zmm0

The following code shows what would be possible to do with a single ZMM register that has all bits set (-1 for all data types):

; ZMM31 is a register with a constant that is all ones
VPTERNLOGD ZMM31, ZMM31, ZMM31, 0xFF ; An example of how to create such constant without loading it from memory

; With VPSUBSLL[B|W|D|Q] it's possible to add an arbitrary power of 2 value with using ZMM31 as constant
; (note since the constant is negative, we use subtraction to make an addition out of it)
VPSUBB    zmm0, zmm1, zmm31    ; zmm0 = zmm1 + 1 (zmm31)      (8-bit int)
VPSUBSLLB zmm0, zmm1, zmm31, 1 ; zmm0 = zmm1 + 2 (zmm31 << 1) (8-bit int)
VPSUBSLLB zmm0, zmm1, zmm31, 2 ; zmm0 = zmm1 + 4 (zmm31 << 2) (8-bit int)
VPSUBSLLB zmm0, zmm1, zmm31, 3 ; zmm0 = zmm1 + 8 (zmm31 << 3) (8-bit int)

VPSUBW    zmm0, zmm1, zmm31    ; zmm0 = zmm1 + 1 (zmm31)      (16-bit int)
VPSUBSLLW zmm0, zmm1, zmm31, 1 ; zmm0 = zmm1 + 2 (zmm31 << 1) (16-bit int)
VPSUBSLLW zmm0, zmm1, zmm31, 2 ; zmm0 = zmm1 + 4 (zmm31 << 2) (16-bit int)
VPSUBSLLW zmm0, zmm1, zmm31, 3 ; zmm0 = zmm1 + 8 (zmm31 << 3) (16-bit int)

VPSUBD    zmm0, zmm1, zmm31    ; zmm0 = zmm1 + 1 (zmm31)      (32-bit int)
VPSUBSLLD zmm0, zmm1, zmm31, 1 ; zmm0 = zmm1 + 2 (zmm31 << 1) (32-bit int)
VPSUBSLLD zmm0, zmm1, zmm31, 2 ; zmm0 = zmm1 + 4 (zmm31 << 2) (32-bit int)
VPSUBSLLD zmm0, zmm1, zmm31, 3 ; zmm0 = zmm1 + 8 (zmm31 << 3) (32-bit int)

VPSUBQ    zmm0, zmm1, zmm31    ; zmm0 = zmm1 + 1 (zmm31)      (64-bit int)
VPSUBSLLQ zmm0, zmm1, zmm31, 1 ; zmm0 = zmm1 + 2 (zmm31 << 1) (64-bit int)
VPSUBSLLQ zmm0, zmm1, zmm31, 2 ; zmm0 = zmm1 + 4 (zmm31 << 2) (64-bit int)
VPSUBSLLQ zmm0, zmm1, zmm31, 3 ; zmm0 = zmm1 + 8 (zmm31 << 3) (64-bit int)

; With VPADDSRL[B|W|D|Q] it's possible to add 1, 3, 7, 15, 31, 63, 127, 255 - essentially any constant that
; could be used for rounding when combined with additional shift, or just for arbitrary use.
VPADDSRLB zmm0, zmm1, zmm31, 7 ; zmm0 = zmm1 + 1   (zmm31 >> 7) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 6 ; zmm0 = zmm1 + 3   (zmm31 >> 6) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 5 ; zmm0 = zmm1 + 7   (zmm31 >> 5) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 4 ; zmm0 = zmm1 + 15  (zmm31 >> 4) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 3 ; zmm0 = zmm1 + 31  (zmm31 >> 3) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 2 ; zmm0 = zmm1 + 63  (zmm31 >> 2) (8-bit int)
VPADDSRLB zmm0, zmm1, zmm31, 1 ; zmm0 = zmm1 + 127 (zmm31 >> 1) (8-bit int)

Of course it would be possible to use any other constant that dominates in the code and reuse it for other purposes by using the presented shift and accumulate instructions.

The following code shows how to pack low-8 bit elements of 16-bit elements from two registers into a single register by using left shift and accumulate operation:

; Combine 8-bit low part of 16-bit elements in zmm1 with 8-bit low part of 16-bit elements in zmm2
; by taking an advantage of knowing that the high 8-bit part of each zmm1 element is zero. In this
; case we can just shift and accumulate, and knowing we accumulate one value with zeros means like
; performing a combining (OR) operation.
VPADDSRLW zmm0, zmm1, zmm2, 8

; Similarly the same trick can be used with wider data types:
VPADDSRLD zmm0, zmm1, zmm2, 16 ; combine low 16-bit elements of 32-bit data type
VPADDSRLQ zmm0, zmm1, zmm2, 32 ; combine low 32-bit elements of 64-bit data type

Conclusion (AVX512_VPACCSHIFT)

Shift and accumulate operations are currently available on ARM platforms and they are very handy when writing SIMD code that works with integers. ARM provides more variety when it comes to shift and accumulation (for example a saturating and rounding shifts and accumulate operations are available), but I have personally used most the most general ones as presented here. Of course having more variability at ISA level would be great, but that would also mean introducing saturating and rounding shifts without accumulation.

AVX512_VPHISTB (Special Use)

Computing a histogram:

VPHISTB zmm1, zmm2, imm ; calculate a histogram of 6 lsb bits of each input byte predicated by imm

VPHISTB instruction would provide a foundation for calculating histograms. The instruction would always use ZMM input and output registers (no 128-bit or 256-bit form). The operation would process each 6-bit LSB part of each input byte of the input register (zmm2) and interpret these 6 bits as an index in the output register, which would get incremented. To make the instruction practical, the immediate value controls which values will get counted - each input byte's 2 MSB bits are compared with the 2 LSB bits of the immediate (predicate), and when they match the byte would be counted. This means that only 4 instructions will be required to calculate a byte-histogram of 64 input bytes.

The operation:

VPHISTB dst:zmm, src:zmm, imm:byte

predicate := imm.bits[1:0]; // 2 bits extracted from imm.
tmp_out   := zmm{};         // temporary ZMM register.

for (i := 0; i < 64; i++) {
  // Extract a single byte from ZMM register at `i`.
  val := src.byte[i];
  msb := val.bits[7:6]
  lsb := val.bits[5:0]

  // If the predicate matches, the byte is counted.
  if msb == predicate {
    tmp_out.byte[lsb] += 1;
  }
}

// Store the calculated histogram into the destination register.
output = tmp

To calculate a histogram of 8-bit entities, VPHISTB has to be used 4 times:

; Calculate a histogram of the input ZMM4 and store the results into ZMM0-3:
VPHISTB zmm0, zmm4, 0b00 ; zmm0 = histogram of bytes in range [0..63]
VPHISTB zmm1, zmm4, 0b01 ; zmm1 = histogram of bytes in range [64..127]
VPHISTB zmm2, zmm4, 0b10 ; zmm2 = histogram of bytes in range [128..192]
VPHISTB zmm3, zmm4, 0b11 ; zmm3 = histogram of bytes in range [192..255]

In order to accumulate these values into 16-bit accumulators, they would need to be extended to 16-bit and then accumulated. However, at least 3 rounds (192 bytes) of VPHISTB can be combined at byte level without overflow (4 rounds are not possible as if ALL 256 bytes were equal then it would overflow). 16-bit accumulators would be enough for calculating histograms of 65535 bytes before the need to accumulate the temporary results into 32-bit accumulators. This means that basically 65535 bytes could be processed without writing anything to memory as 256 16-bit counters need only 8 ZMM registers.

Conclusion (AVX512_VPHISTB)

I'm not sure how feasible is to have an instruction like this as it basically requires 64 one-bit additions of 6-bit integers. However, histograms are used a lot and there are no good extensions for calculating histograms on x86. The most general purpose approach (a simple for loop with accumulation into memory) is still a golden standard - the loop can be unrolled for better performance, but it's still in a ballpark of 1-2 cycles per byte. With this proposed extension it would be possible to accelerate this by an order of magnitude.