AVX512 mask registers support in compilers

Author:Wojciech Muła
Added on:2018-05-18
Update on:2018-05-22 (I was utterly wrong)

Introduction

AVX512 introduced the set of 64-bit mask registers, called in assembler k0 ... k7. A mask can be used to:

The latter is also useful, as there is instruction ktest that updates the flags register, EFLAGS. Prior to AVX512 an extra instruction — like pmovmskb or ptest (SSE 4.1) — has to be used in order to alter control flow based on vectors content.

There are four variants of ktest kx, ky that operates on 8, 16, 32 or 64 bits of mask registers, but basically they perform the same operation:

ZF := (kx AND ky) == 0
CF := (kx AND NOT ky) == 0

2018-05-22 update: unfortunately the instruction is not available in AVX512F; 8- and 16-bit variants are available in AVX512DQ, 32- and 64-bit in AVX512BW.

Problem

I wanted to test if any element in vector of 32-bit integers is non-zero. The intrinsics C code for this:

int anynonzero_epi32(__m512i x) {
    const __m512i   zero = _mm512_setzero_si512();
    const __mmask16 mask = _mm512_cmpneq_epi32_mask(x, zero);
    return mask != 0;
}

Below is the assembly code, which employs ktest:

vpxor      %xmm1, %xmm1, %xmm1      # xmm1 := 0
vpcmpneqd  %zmm1, %zmm0, %k1        # k1   := (xmm0 != xmm1)
xor        %eax, %eax               # eax  := 0
ktestw     %k1, %k1                 # ZF   := (k1 == 0) --- 1 if all are zero
setne      %al                      # eax  := not ZF

Compiler output update

When I compiled the above program with wrong flag -mavx512f obviously none of GCC 8.1 and Clang 6.0.0 emit ktest; they generated code like this:

vpxor xmm1, xmm1, xmm1
vpcmpd k1, zmm0, zmm1, 4
kmovw eax, k1
test ax, ax
setne al
movzx eax, al

With the proper flag -mavx512dq both GCC and Clang emit ktestw. GCC 7.3.0 and Clang 3.9.8 were the last versions that didn't support this instruction.

I still can't force ICC to generate ktestw; with argument -xCORE-AVX512 the version 18.0 emits:

xor eax, eax
vptestmd k0, zmm0, zmm0
kmovw edx, k0
test edx, edx
seta al

ICC does one thing better — it replaces the pair vpxor/vpcmpneqd with vptestmd instruction. I filled GCC bug to add this optimization.