BMI and BMI2 Instructions Usage

This article provides insight into using some instructions that are part of BMI and BMI2 extensions.

RORX - A Hidden Gem (BMI2)

RORX instruction was introduced by BMI2 extension and it's the only BMI/BMI2 instruction that accepts an immediate value. All other instructions in this category only work with registers or r/m, which is a bit annoying as many shifts actually need the immediate. In general, BMI2 shift and rotate instructions can be used to replace MOV followed by SHL, SHR, SAR, and RORX:

; Arithmetic and logical shifts (n must be in a CL register):
mov dst, src
sar/shr/shl dst, n

; Replaced by a single BMI2 instruction (n can be any register):
sarx/shrx/shlx dst, src, n

; Rotate right with immediate (with separate destination):
mov dst, src
ror dst, imm

; Replaced by RORX:
rorx dst, src, imm

RORX does rotation so it doesn't shift in zeros, but the shifted out bits, however, this is not a problem if we work with 32-bit values and use 64-bit rotation, because the shifted in bits will be in a high 32-bit part of the register, that will either not be used or a next operation would overwrite it with zeros, because of zero extension. This way, RORX can be used to supplement the missing SHRX with immediate:

; Logical shift right and addition, for example:
mov ebx, eax
shr ebx, imm
add ebx, eax         ; Or any other operation

; Can be rewritten to use RORX if dst is 32-bit:
rorx rbx, rax, imm   ; Shift by rotation
add ebx, eax         ; Op + Zero extends the result

Additionally, RORX can be used like this to decompose structs or bit fields that fit in a 32-bit or 64-bit value. Imagine the following struct in C/C++ language:

struct UInt32Pair {
  uint32_t x;
  uint32_t y;
};

If you have the pair in a single register it's possible to sum X and Y components with only two operations and get a 32-bit result:

; RAX contains the struct pair loaded as a 64-bit integer
mov rax, [some_address]

; RORX can be used to "extract" "the higher 32 bits
rorx rbx, rax, 32

; And finally, add can be used to sum both 32-bit values in eax
; and ebx, which would produce a 32-bit zero extended output
add ebx, eax

RORX can be used the same way to extract other bit quantities, when necessary.

BZHI - LSB masks done fast

BZHI (Zero High Bits Starting with Specified Bit Position) instruction can be used to create LSB bit masks, where there is one or more consecutive bits starting from the least significant bit. In the past this operation required multiple steps and if we wanted to do this in C it would look like the following:

#include <stdint.h>

// Creates a bit mask where N lsb bits is set (32-bit version):
//
// IMPORTANT: The problem of this implementation is that it's
// undefined behavior if nBits is 32 or greater, so it's not
// possible to use this function to create a bit-mask where
// all bits are set.
uint32_t lsb_mask_v1(uint32_t nBits) {
  return (1u << nBits) - 1;
}

// To fix the undefined behavior it's possibly to promote the
// operation to 64 bits, however, this would only be efficient
// on a 64-bit machine and if we wanted to have 64-bit result
// there is no promotion to 128 bits available, so the following
// approach would only work if the promotion is possible:
uint32_t lsb_mask_v2(uint32_t nBits) {
  return uint32_t(
    (uint64_t(1) << nBits) - 1
  );
}

// Alternative solutions are possible, we can start with all
// ones and just shift values right.
//
// IMPORTANT: This has the opposite problem than v1. It would
// be undefined behavior if nBits is 0 as that
// would mean shifting by 32.
uint32_t lsb_mask_v3(uint32_t nBits) {
  return ~uint32_t(0) >> (32u - nBits);
}

// Again, the computation can be promoted to 64 bits.
uint32_t lsb_mask_v4(uint32_t nBits) {
  return uint32_t(
    uint64_t(~uint32_t(0)) >> (32u - nBits)
  );
}

The input parameter nBits is often either a constant or it comes from leading/trailing bit counting operations that calculate how many bits are set to zero, counting either from LSB or MSB bit. BZHI instruction actually kills two flies by one stone here - it provides better performance as it reduces the number of instructions used to compute the bit-mask, and it removes the undefined behavior problem. So, how would the implementation that uses BZHI look like?

#include <stdint.h>
#include <x86intrin.h>

// Calculates a LSB bit-mask by using BZHI instruction for nBits from 0 to 32.
// This approach should expand to 2 instructions - one moving -1 to a register
// and the second using `BZHI` to zero the upper bits which are not part of the
// mask.
uint32_t lsb_mask_bzhi(uint32_t nBits) {
  return _bzhi_u32(~uint32_t(0), nBits);
}

Interestingly, it's required to use intrinsics in order to use BZHI as it's difficult to map existing code to use this instruction. For example the compiler would not use it to compile the previous lsb_mask_v? functions as it semantically doesn't match.