| Author: | Wojciech Muła | 
|---|---|
| Added on: | 2015-03-22 | 
| Updates: | 2020-01-11 | 
Contents
Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however it requires additional, pre-set register.
AVX512F will come with a very interesting instruction called vpternlog. There are two variants of the instruction operating on a packed 32-bit (vpternlogd) or a 64-bit vector (vpternlogq), however they do exactly the same thing — evaluate a 3-argument (ternary) boolean function on each bit of arguments, the function is given as a truth table.
The pattern of a truth table:
| inputs | result | ||
|---|---|---|---|
| A | B | C | |
| 0 | 0 | 0 | a | 
| 0 | 0 | 1 | b | 
| 0 | 1 | 0 | c | 
| 0 | 1 | 1 | d | 
| 1 | 0 | 0 | e | 
| 1 | 0 | 1 | f | 
| 1 | 1 | 0 | g | 
| 1 | 1 | 1 | h | 
A programmer supplies only the result column, i.e. defines values of bits a through h, this is a single 8-bit value.
Depending on function complexity, a single vpternlog instruction can replace from one up to eight SIMD instructions.
According to Agner Fog's documentation on SkylakeX vpternlog has 1 cycle latency and 0,5 cycle reciprocal throughput (there are two execution units able to handle the instruction). It's pretty fast, though.
Ternary logic function is available as the intrinsic function _mm512_ternarylogic_epi32(a, b, c, imm8), where the argument a carries most significant bits, and c least significant bits.
A ternary function may be seen as a select function, i.e. the most significant bit A select one of binary functions of B and C. The function for A=0 is described by bits a..d, the function for A=1 by bits e..h.
As @solardiz noted this function is available in OpenCL as bitselect(), AMD XOP has instruction VPCMOV, AltiVec has VSEL, NEON has VBSL, also several GPUs support such instruction.
Example A ? B : C.
| inputs | result | ||
|---|---|---|---|
| A | B | C | |
| 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 1 | 
| 0 | 1 | 0 | 0 | 
| 0 | 1 | 1 | 1 | 
| 1 | 0 | 0 | 0 | 
| 1 | 0 | 1 | 0 | 
| 1 | 1 | 0 | 1 | 
| 1 | 1 | 1 | 1 | 
| constant | 0xca | ||
The assembly code using binary functions:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1
Let see for example function (A or not B) and C. The truth table:
| inputs | result | ||
|---|---|---|---|
| A | B | C | |
| 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 1 | 
| 0 | 1 | 0 | 0 | 
| 0 | 1 | 1 | 0 | 
| 1 | 0 | 0 | 0 | 
| 1 | 0 | 1 | 1 | 
| 1 | 1 | 0 | 0 | 
| 1 | 1 | 1 | 1 | 
To express the function using standard SSE instructions we need three operations:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 pcmpeqb xmm4, xmm4 // constant, could exist earlier pxor xmm0, xmm4 // A1 := not A por xmm0, xmm1 // AB := A or B1 pand xmm0, xmm2 // result := AB and C
With AVX512 it would be very simple:
; a - zmm0 and result ; b - zmm1 ; c - zmm2 vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010
| inputs | OR | AND | XOR | ||
|---|---|---|---|---|---|
| A | B | C | |||
| 0 | 0 | 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 1 | 0 | 1 | 
| 0 | 1 | 0 | 1 | 0 | 1 | 
| 0 | 1 | 1 | 1 | 0 | 0 | 
| 1 | 0 | 0 | 1 | 0 | 1 | 
| 1 | 0 | 1 | 1 | 0 | 0 | 
| 1 | 1 | 0 | 1 | 0 | 0 | 
| 1 | 1 | 1 | 1 | 1 | 1 | 
| constant | 0xfe | 0x80 | 0x96 | ||
The assembly code of or all using binary functions:
; A - xmm0 and result ; B - xmm1 ; C - xmm2 por xmm0, xmm1 por xmm0, xmm2
Function is true when only one bit is set.
| inputs | result | ||
|---|---|---|---|
| C | B | A | |
| 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 1 | 
| 0 | 1 | 0 | 1 | 
| 0 | 1 | 1 | 0 | 
| 1 | 0 | 0 | 1 | 
| 1 | 0 | 1 | 0 | 
| 1 | 1 | 0 | 0 | 
| 1 | 1 | 1 | 0 | 
| constant | 0x16 | ||
The C code using binary functions:
// t1 = ~(A | B) & C __m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C); // t2 = ~C & (A ^ B) __m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);
The procedure uses five instructions.
Function is true exactly two bits are set.
| inputs | result | ||
|---|---|---|---|
| C | B | A | |
| 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 0 | 
| 0 | 1 | 0 | 0 | 
| 0 | 1 | 1 | 1 | 
| 1 | 0 | 0 | 0 | 
| 1 | 0 | 1 | 1 | 
| 1 | 1 | 0 | 1 | 
| 1 | 1 | 1 | 0 | 
| constant | 0x68 | ||
The C code using binary functions:
// t1 = ~C & (A & B) __m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B)); // t2 = C & (A ^ B) __m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B)); __m512i result = _mm512_or_si512(t1, t2);
The procedure uses five instructions.
Popular hashing algorithms use several three-argument logic functions in their main loops. Although the functions might seem complicated, they are expressed with single vpternlog instruction.
MD5 hashing defines four state changing functions (the order of variables from the most significant bit: x, y, z).
| expression | vpternlogd const | 
|---|---|
| (x and y) or (not x and z) | 0xca | 
| (x and z) or (y and not z) | 0xe4 | 
| x xor y xor z | 0x96 | 
| y xor (x or not z) | 0x39 | 
Similarly, SHA-1 uses three ternary logic functions in its main loop (the order of variables from the most significant bit: b, c, d).
| expression | vpternlogd const | 
|---|---|
| (b and c) or (not b and d) | 0xca | 
| b xor c xor d | 0x96 | 
| (b and c) or (b and d) or (c and d) | 0xe8 | 
Also SHA-2 uses ternary logic functions in its main loop. I omitted a three-argument xor, which is shown above (the order of variables from the most significant bit: e, f, g; and a, b, c).
| expression | vpternlogd const | 
|---|---|
| (e and f) xor (not e and g) | 0xca | 
| (a and b) xor (a and c) xor (b and c) | 0xe8 | 
The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.
| inputs | outputs | |||
|---|---|---|---|---|
| C | B | A | digit | carry | 
| 0 | 0 | 0 | 0 | 0 | 
| 0 | 0 | 1 | 1 | 0 | 
| 0 | 1 | 0 | 1 | 0 | 
| 0 | 1 | 1 | 0 | 1 | 
| 1 | 0 | 0 | 1 | 0 | 
| 1 | 0 | 1 | 0 | 1 | 
| 1 | 1 | 0 | 0 | 1 | 
| 1 | 1 | 1 | 1 | 1 | 
The fastest implementation of above functions contains five logic instructions:
tmp = a ^ b carry = (a & b) | (tmp & c) digit = tmp ^ c;
With help of vpternlogd this can be calculated as:
l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110 h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 1110_1000