BMI and BMI2 Instructions Usage

This article provides insight into using some instructions that are part of BMI and BMI2 extensions.

RORX - A Hidden Gem (BMI2)

RORX instruction was introduced by BMI2 extension and it's the only BMI/BMI2 instruction that accepts an immediate value. All other instructions in this category only work with registers or r/m, which is a bit annoying as many shifts actually need the immediate. In general, BMI2 shift and rotate instructions can be used to replace MOV followed by SHL, SHR, SAR, and RORX:

; Arithmetic and logical shifts (n must be in a CL register):
mov dst, src
sar/shr/shl dst, n

; Replaced by a single BMI2 instruction (n can be any register):
sarx/shrx/shlx dst, src, n

; Rotate right with immediate (with separate destination):
mov dst, src
ror dst, imm

; Replaced by RORX:
rorx dst, src, imm

RORX does rotation so it doesn't shift in zeros, but the shifted out bits, however, this is not a problem if we work with 32-bit values and use 64-bit rotation, because the shifted in bits will be in a high 32-bit part of the register, that will either not be used or a next operation would overwrite it with zeros, because of zero extension. This way, RORX can be used to supplement the missing SHRX with immediate:

; Logical shift right and addition, for example:
mov ebx, eax
shr ebx, imm
add ebx, eax         ; Or any other operation

; Can be rewritten to use RORX if dst is 32-bit:
rorx rbx, rax, imm   ; Shift by rotation
add ebx, eax         ; Op + Zero extends the result

Additionally, RORX can be used like this to decompose structs or bit fields that fit in a 32-bit or 64-bit value. Imagine the following struct in C/C++ language:

struct UInt32Pair {
  uint32_t x;
  uint32_t y;
};

If you have the pair in a single register it's possible to sum X and Y components with only two operations and get a 32-bit result:

; RAX contains the struct pair loaded as a 64-bit integer
mov rax, [some_address]

; RORX can be used to "extract" "the higher 32 bits
rorx rbx, rax, 32

; And finally, add can be used to sum both 32-bit values in eax
; and ebx, which would produce a 32-bit zero extended output
add ebx, eax

RORX can be used the same way to extract other bit quantities, when necessary.