SSE: modify 32bpp images with lookup tables

Author:Wojciech Muła
Added on:2008-06-01
Updated on:2016-03-04 (+link to github)

32bpp pixels have four components: red, green, blue and alpha channel. The same number of lookup tables is needed; elements of tables has size 4 bytes, and can be combined with simple or:

transformed_pixel := LUT_R[R] or LUT_G[G] or LUT[B] or LUT[A]

Or without alpha channel:

transformed_pixel := LUT_R[R] or LUT_G[G] or LUT[B]

I did some tests with SSE2 and SSE4 instructions used to minimize memory references — with a single XMM instruction 16 bytes are read. Main problem is how to extract bytes or double words from the selected position of an XMM register.

x86 code

The x86 code is a base for further improvements. If pixel is loaded into an x86 register, following code can be used to extract all RGBA components:

movl  (%%esi), %%eax    ; eax - pixel

movzbl  %%al, %%ebx     ; R
movzbl  %%ah, %%ecx     ; G
shrl     $16, %%eax
movzbl  %%al, %%edx     ; B
movzbl  %%ah, %%eax     ; A

movl    LUT_R(,%%ebx,4), %%ebx
orl     LUT_G(,%%ecx,4), %%ebx
orl     LUT_A(,%%edx,4), %%ebx
orl     LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel

movl    %%ebx, (%%edi)

Code that works with RGB pixels is of course shorter:

movl  (%%esi), %%eax    ; eax - pixel

movzbl  %%al, %%ebx     ; R
movzbl  %%ah, %%ecx     ; G
shrl     $16, %%eax
movzbl  %%al, %%edx     ; B

movl    LUT_R(,%%ebx,4), %%ebx
orl     LUT_G(,%%ecx,4), %%ebx
orl     LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel

movl    %%ebx, (%%edi)

SSE2 code

SSE2 code uses the same scheme as the x86 code, however it fetches 4 pixels at the same time, and load eax from XMM register with a MOVD instruction. Since MOVD moves lowest dword, additional shifts are needed to place all dwords at that position — PSHUFD instruction is used to do this.

SSE4 code

SSE4 (SSE4.1) introduced instructions PEXTRB, PEXTRD and PEXTRQ — element's index is hardcoded in opcode, destination is register or memory location, extracted byte/dword/qword is zero-extended. Contrary operation is performed by PINSRx instructions. These instructions seem perfect, do exactly what SSE-assist lookup needs.

PEXTRx/PINSRx have throughput one cycle, however latency is very long — five cycles. I think it is possible to compensate latency, but not in 32-bit code — we can use just 5 registers, because 3 are used for two pointers and one is a loop counter; the 64-bit mode gives 8 extra registers.

Tests results

Tests was done on Core 2 Duo @ 2.6GHz, under Linux control. Image 1024 x 768 was transformed 1000 times, test were run 10 times.

Sample program is available at github, and was compiled with following options:

gcc -O3 lookup_32bpp.c -o test_rgb
gcc -O3 -DRGBA lookup_32bpp.c -o test_rgba

Function naive is a C implementation. GCC generated code very similar to x86 presented above, however added some extra instructions that slowed down whole procedure.

Other function refers to these described earlier.

RGBA pixels

Gain 1.3 times.

function time [s] speedup
naive 2.26 100%
x86 1.90 119%
SSE2 1.76 128%
SSE4 1.89 120%

RGB pixels

No observable gain.

function time [s] speedup
naive 1.55 100%
x86 1.57 98%
SSE2 1.53 101%
SSE4 1.54 100%