Petr Kobalicek

Software engineer focusing on writing

high-performance code in C++ & Asm

high-performance code in C++ & Asm

This article provides alternative implementation of some instructions that were introduced by SSSE3, SSE4.1, and SSE4.2 extensions in a baseline SSE2. This code can be found in various SIMD layers that target multiple SIMD levels on X86 architectures and still provide SSE2 as a baseline for compatibility. In addition, some tricks shown here can be useful to solving other SIMD puzzles even when targeting more advanced SIMD extensions like AVX2 and AVX-512.

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. It's possible to implement `PABSW/PABSD`

in pure SSE2, and also a missing `PABSQ`

(packed absolute value of 64-bit integers), which was introduced pretty recently by AVX512-F.

Let's start with a straightforward implementation in C first:

```
inline uint32_t abs32(int32_t x) {
return x >= 0 ? uint32_t(x) : (uint32_t(0)-uint32_t(x));
}
```

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it either via masking or by taking advantage of how 2s complement negation works. The idea is to shift sign bit (which would be only set if the number is negative) to all other bits by arithmetic shift, to XOR it with the input, and next to subtract it from the input as well. This approach is essentially a conditional negation of a number. Considering that a `mask`

is all zeros (positive number) or all ones (negative number), the conditional/masked negations would be implemented as `(x ^ mask) - mask`

.

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

```
// SSE2 compatible PABSW implementation (unoptimized).
static inline __m128i sse2_pabsw(const __m128i& x) {
__m128i msk = _mm_srai_epi16(x, 15);
return _mm_sub_epi16(_mm_xor_si128(x, msk), msk);
}
// SSE2 compatible PABSD implementation.
static inline __m128i sse2_pabsd(const __m128i& x) {
__m128i msk = _mm_srai_epi32(x, 31);
return _mm_sub_epi32(_mm_xor_si128(x, msk), msk);
}
```

These were straightforward translations based on the initial C++ code. However, there is a better way of implementing `PABSW`

and there is also a way of implementing `PABSB`

without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as `max(x, -x)`

we can use packed min/max to implement `PABSB`

and `PABSW`

:

```
// SSE2 compatible PABSB implementation.
static inline __m128i sse2_pabsb(const __m128i& x) {
__m128i zero = _mm_setzero_si128();
return _mm_min_epu8(x, _mm_sub_epi8(zero, x));
}
// SSE2 compatible PABSW implementation.
static inline __m128i sse2_pabsw(const __m128i& x) {
__m128i zero = _mm_setzero_si128();
return _mm_max_epi16(x, _mm_sub_epi16(zero, x));
}
```

The PABSW implementation is straightforward and I have nothing to add, however, `PABSB`

implementation is interesting as it workarounds the missing `PMAXSB`

instruction (which was introduced in SSE4.1) by using `PMINUB`

instead, which works for us based on the knowledge about both inputs - selecting the minimum unsigned value is the same as selecting the maximum signed value in our case.

64-bit packed absolute value is trickier as there is no `PSRAQ`

instruction in SSE2 (`VPSRAQ`

was first introduced in AVX512-F), however, we can shuffle the input a bit and use `PSRAD`

again:

```
// SSE2 compatible PABSQ implementation.
static inline __m128i sse2_pabsq(const __m128i& x) {
__m128i msk = _mm_srai_epi32(_mm_shuffle_epi32(x, _MM_SHUFFLE(3, 3, 1, 1)), 31);
return _mm_sub_epi64(_mm_xor_si128(x, msk), msk);
}
```

These instructions were introduced by SSE4.1 extensions, which adds the remaining min/max functions that were missing in the baseline SSE2, which only contains signed versions (`PMINSW`

and `PMAXSW`

).

The solution is to use `PSUBUSW`

(packed subtract with unsigned saturation) and then to subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as SSE2 (and X86 SIMD in general) doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

```
// SSE2 compatible PMINUW implementation.
static inline __m128i sse2_pminuw(const __m128i& x, const __m128i& y) {
return _mm_sub_epi16(x, _mm_subs_epu16(x, y));
}
```

Why it works? If we perform `a-b`

with unsigned saturation we get either zero, which means that `b`

is either equal or greater than `a`

, or some non-zero value, which means that `a`

is greater than `b`

, and the value is their difference (unsigned). Based on these we can subtract that value from the original `a`

and get an unsigned minimum:

```
// SSE2 compatible PMAXUW implementation.
static inline __m128i sse2_pmaxuw(const __m128i& x, const __m128i& y) {
return _mm_add_epi16(_mm_subs_epu16(x, y), y);
}
```

`VPSRAQ`

instruction was introduced by AVX-512. This means that both SSE2 and AVX2 code has to do without. However, it's easy to implement this instruction by first using logical shift right and then combining the result of that instruction with extracted signs from the input.

```
// SSE2 compatible VPSRAQ implementation.
template<uint8_t kN>
static inline __m128i sse2_psraq(const __m128i& a) {
__m128i highs = _mm_shuffle_epi32(a, _MM_SHUFFLE(3, 3, 1, 1));
__m128i signs = _mm_srai_epi32(highs, 31);
// Fast-path - if the request is to repeat all MSB bits, just return signs.
if (kN == 63u)
return signs;
// Generic approach - shift logical right 64-bit integers and combine with signs at MSB.
__m128i msk = _mm_slli_epi64(signs, (64 - kN) & 63);
return _mm_or_si128(msk, _mm_srli_epi64(a, kN));
}
```

`PCMPEQQ`

instruction was introduced by SSE4.1, however, it could be trivially emulated by using a 32-bit comparison, shuffle, and AND operation. The idea is to perform a 32-bit comparison and then swap 32-bit elements in each 64-bit lane so the comparisons can be combined together:

```
// SSE2 compatible PCMPEQQ implementation.
static inline __m128i sse2_pcmpeqq(const __m128i& a, const __m128i& b) {
__m128i x = _mm_cmpeq_epi32(a, b);
__m128i y = _mm_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1));
return _mm_and_si128(x, y);
}
```

`PCMPGTQ`

instruction provides greater than operation of two signed 64-bit integers. It was introduced by SSE4.2, which was the last SSE extension before AVX. Emulating this instruction is tricky and relatively slow, because it requires at least two 32-bit comparisons and additional logic to shuffle the intermediate results so they can be combined. The implementation shown here follows the answer of the "implementing PCMPGTQ" question on stack overflow, because it provides a solution that is very hard to optimize further. The idea is that a 64-bit subtraction will create a mask that can be combined with 32-bit comparisons:

```
// SSE2 compatible PCMPGTQ implementation.
//
// Based on: https://stackoverflow.com/questions/65166174/how-to-simulate-pcmpgtq-on-sse2
static inline __m128i sse2_pcmpgtq(const __m128i& a, const __m128i& b) {
__m128i msk = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a));
msk = _mm_or_si128(msk, _mm_cmpgt_epi32(a, b));
return _mm_shuffle_epi32(msk, _MM_SHUFFLE(3, 3, 1, 1));
}
```

The baseline SSE2 instruction set only provides packed 16-bit lane-by-lane multiplication. It also provides other instructions that can be used to implement 32-bit lane-by-lane multiplication such as `PMULUDQ`

, which multiplies two unsigned 32-bit integers and produces a 64-bit result. This instruction can be used to also implement 64-bit lane-by-lane multiplication, both shown below:

```
// SSE2 compatible PMULLD implementation.
static inline __m128i sse2_pmulld(const __m128i& a, const __m128i& b) {
__m128i hi = _mm_mul_epu32(_mm_shuffle_epi32(a, _MM_SHUFFLE(3, 3, 1, 1)),
_mm_shuffle_epi32(b, _MM_SHUFFLE(3, 3, 1, 1)));
__m128i lo = _mm_mul_epu32(a, b);
__m128i result3120 =
_mm_castps_si128(
_mm_shuffle_ps(
_mm_castsi128_ps(lo),
_mm_castsi128_ps(hi),
_MM_SHUFFLE(2, 0, 2, 0)));
return _mm_shuffle_epi32(result3120, _MM_SHUFFLE(3, 1, 2, 0));
}
// SSE2 compatible PMULLQ implementation.
//
// (NOTE: This implementation is great when used with AVX2 code (256-bit vectors),
// but it's not so great when used with 128-bit code as using general purpose 64-bit
// IMUL should be just faster than using 3 SIMD multiplications).
static inline __m128i sse2_pmullq(const __m128i& a, const __m128i& b) {
__m128i al_bh = _mm_mul_epu32(a, _mm_shuffle_epi32(b, _MM_SHUFFLE(3, 3, 1, 1)));
__m128i ah_bl = _mm_mul_epu32(b, _mm_shuffle_epi32(a, _MM_SHUFFLE(3, 3, 1, 1)));
__m128i al_bl = _mm_mul_epu32(a, b);
__m128i prod1 = _mm_slli_epi64(_mm_add_epi64(al_bh, ah_bl), 32);
return _mm_add_epi64(al_bl, prod1);
}
```

`PALIGNR`

instruction was introduced by SSSE3. Regardless of the name, it essentially combines two 128-bit registers into a 256-bit integer, byte shifts it to the right, and then uses the low 128-bit part as a result. Fortunately, this instruction can be easily implemented by using the existing byte-shift instructions and then combining the result:

```
// SSE2 compatible PALIGNR implementation.
template<int kN>
static inline __m128i sse2_palignr(const __m128i& a, const __m128i& b) {
__m128i a_shifted = _mm_slli_si128(a, 16u - kN);
__m128i b_shifted = _mm_srli_si128(b, kN);
return _mm_or_si128(a_shifted, b_shifted);
}
```

Floating point rounding was always a painful job on X86 targets, because of the lacking support from HW point of view. Historically, X86 offered only instructions to round by using the current rounding mode, which is in most cases assumed to be round-to-even. Round-to-even is the best option for intermediate computations (all floating computations you do would be rounded this way by default), however, it's probably the least used "explicit" rounding mode. In C/C++ usually the most used rounding function is `floor()`

, `trunc()`

, and `ceil()`

.

The following C++ code provides rounding functions of a double precision floating point that use baseline SSE2 instructions:

```
#include <stdint.h>
#include <emmintrin.h>
// SSE2 compatible blend instruction.
static inline __m128d sse2_blend(const __m128d& x, const __m128d& y, const __m128d& mask) {
return _mm_or_pd(_mm_and_pd(y, mask),
_mm_andnot_pd(mask, x));
}
// SSE2 compatible ROUNDSD (nearby) implementation.
static inline double sse2_nearby(double x) {
__m128d src = _mm_set_sd(x);
__m128d maxn = _mm_set_sd(4503599627370496.0);
__m128d magic = _mm_set_sd(6755399441055744.0);
__m128d msk = _mm_cmpnlt_sd(src, maxn);
__m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);
return _mm_cvtsd_f64(sse2_blend(rnd, src, msk));
}
// SSE2 compatible ROUNDSD (trunc) implementation.
static inline double sse2_trunc(double x) {
static const uint64_t kSepMask[1] = { 0x7FFFFFFFFFFFFFFFu };
__m128d src = _mm_set_sd(x);
__m128d sep = _mm_load_sd(reinterpret_cast<const double*>(kSepMask));
__m128d src_abs = _mm_and_pd(src, sep);
__m128d sign = _mm_andnot_pd(sep, src);
__m128d maxn = _mm_set_sd(4503599627370496.0);
__m128d magic = _mm_set_sd(6755399441055744.0);
__m128d msk = _mm_or_pd(_mm_cmpnlt_sd(src_abs, maxn), sign);
__m128d rnd = _mm_sub_sd(_mm_add_sd(src_abs, magic), magic);
__m128d maybeone = _mm_and_pd(_mm_cmplt_sd(src_abs, rnd), _mm_set_sd(1.0));
return _mm_cvtsd_f64(sse2_blend(_mm_sub_sd(rnd, maybeone), src, msk));
}
// SSE2 compatible ROUNDSD (floor) implementation.
static inline double sse2_floor(double x) {
__m128d src = _mm_set_sd(x);
__m128d maxn = _mm_set_sd(4503599627370496.0);
__m128d magic = _mm_set_sd(6755399441055744.0);
__m128d msk = _mm_cmpnlt_sd(src, maxn);
__m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);
__m128d maybeone = _mm_and_pd(_mm_cmplt_sd(src, rnd), _mm_set_sd(1.0));
return _mm_cvtsd_f64(sse2_blend(_mm_sub_sd(rnd, maybeone), src, msk));
}
// SSE2 compatible ROUNDSD (ceil) implementation.
static inline double sse2_ceil(double x) {
__m128d src = _mm_set_sd(x);
__m128d maxn = _mm_set_sd(4503599627370496.0);
__m128d magic = _mm_set_sd(6755399441055744.0);
__m128d msk = _mm_cmpnlt_sd(src, maxn);
__m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);
__m128d maybeone = _mm_and_pd(_mm_cmpnle_sd(src, rnd), _mm_set_sd(1.0));
return _mm_cvtsd_f64(sse2_blend(_mm_add_sd(rnd, maybeone), src, msk));
}
```

The code can actually be ported to use 2 double precision floating points instead of a scalar and thus to round 2 values at once.

Today's X86 CPUs will most likely all have SSSE3, SSE4.1, and SSE4.2 extensions, so you won't need these tricks today. However, they could still be handly if you target a baseline SSE2 ISA or some totally different ISA that provides unbalanced SIMD instructions. In addition, if you will find similar tricks in portable SIMD libraries, that all provide a good baseline of operations without having to worry whether it's native or emulated.