This article provides insight into using some instructions that are part of BMI and BMI2 extensions.
RORX
instruction was introduced by BMI2 extension and it's the only BMI/BMI2 instruction that accepts an immediate value. All other instructions in this category only work with registers or r/m, which is a bit annoying as many shifts actually need the immediate. In general, BMI2 shift and rotate instructions can be used to replace MOV
followed by SHL
, SHR
, SAR
, and RORX
:
; Arithmetic and logical shifts (n must be in a CL register):
mov dst, src
sar/shr/shl dst, n
; Replaced by a single BMI2 instruction (n can be any register):
sarx/shrx/shlx dst, src, n
; Rotate right with immediate (with separate destination):
mov dst, src
ror dst, imm
; Replaced by RORX:
rorx dst, src, imm
RORX
does rotation so it doesn't shift in zeros, but the shifted out bits, however, this is not a problem if we work with 32-bit values and use 64-bit rotation, because the shifted in bits will be in a high 32-bit part of the register, that will either not be used or a next operation would overwrite it with zeros, because of zero extension. This way, RORX
can be used to supplement the missing SHRX
with immediate:
; Logical shift right and addition, for example:
mov ebx, eax
shr ebx, imm
add ebx, eax ; Or any other operation
; Can be rewritten to use RORX if dst is 32-bit:
rorx rbx, rax, imm ; Shift by rotation
add ebx, eax ; Op + Zero extends the result
Additionally, RORX
can be used like this to decompose structs or bit fields that fit in a 32-bit or 64-bit value. Imagine the following struct in C/C++ language:
struct UInt32Pair {
uint32_t x;
uint32_t y;
};
If you have the pair in a single register it's possible to sum X and Y components with only two operations and get a 32-bit result:
; RAX contains the struct pair loaded as a 64-bit integer
mov rax, [some_address]
; RORX can be used to "extract" "the higher 32 bits
rorx rbx, rax, 32
; And finally, add can be used to sum both 32-bit values in eax
; and ebx, which would produce a 32-bit zero extended output
add ebx, eax
RORX
can be used the same way to extract other bit quantities, when necessary.