Speeding up multiple vector operations using SIMD

Author:	Wojciech Muła
Added on:	2018-11-14

Contents

Introduction
Baseline
Optimization
- Distance function
- Multiple distances at once
Evaluation
Conclusions
Source code

Introduction

One step of k-means algorithm is calculating the distance between all centroids and all samples. Then centroids are recalculated and samples re-assigned. Centroids and also samples are vectors of fixed size.

I was curious how SIMD might help in this task (or similar ones).

Baseline

C++ code that shows the idea of this step:

// M - vector size
// K - number of centroids
// N - number of samples

// ...
using Vector = float[M];
// ...
Vector centroids[K];
float  distances[K];
Vector samples[N];
// ...

for (int i=0; i < K; i++) {
    for (int j=0; j < N; j++) {
        distances[j] = calculate_distance(M, centroids[i], samples[j]);
    }

    // somehow consume 'distances' vector
}

The distance function is often the sum of squares of differences:

float scalar_dist_vectors(size_t n, float* a, float* b) {

    float ret = 0.0f;

    for (size_t i=0; i < n; i++) {
        const float d = a[i] - b[i];
        ret += d*d;
    }

    return ret;
}

Optimization

Distance function

Since we operates on vectors, the most obvious use of SIMD instructions is to vectorize the distance procedure calculate_distance.

A decent compiler is able to auto-vectorize such a simple function; GCC and clang require flag -ffast-math to enable this. Here is the assembly code generated by GCC 7.3.1:

.L7:
        movaps  (%r10,%rax), %xmm1
        addl    $1, %r8d
        movups  (%rcx,%rax), %xmm3
        addq    $16, %rax
        cmpl    %r11d, %r8d
        subps   %xmm3, %xmm1
        mulps   %xmm1, %xmm1
        addps   %xmm1, %xmm2
        jb      .L7

Below is a hand-coded version, without processing vectors' tail.

float sse_dist_vector(size_t n, float* a, float* b) {

    __m128 ret = _mm_setzero_ps();

    for (size_t i=0; i < n; i += 4) {

        const __m128 ai = _mm_loadu_ps(a + i);
        const __m128 bi = _mm_loadu_ps(b + i);

        const __m128 d  = _mm_sub_ps(ai, bi);
        const __m128 d2 = _mm_mul_ps(d, d);

        ret = _mm_add_ps(ret, d2);
    }

    float v[4];
    _mm_storeu_ps(v, ret);

    return v[0] + v[1] + v[2] + v[3];
}

Multiple distances at once

In the inner loop the expression centroid[i] is constant. We can use this fact and calculate multiple distances at once, minimizing number of memory fetches from centroids array and also utilizing all SIMD resources.

So the algorithm becomes:

for (int i=0; i < K; i++) {
    calculate_multiple_distances(centroid[i], samples, distances);
}

And actual implementation of calculate_multiple_distances, which processes four vectors at once:

void sse_dist_vector_many(size_t n, float const_vector[], size_t k, float* in[], float out[]) {

    for (size_t j=0; j < k; j += 4) {

        float* in0 = in[j + 0]; // Note: factoring out these pointer gave pretty
        float* in1 = in[j + 1]; //       nice boost, as GCC 7.3.1 couldn't handle
        float* in2 = in[j + 2]; //       well complex addresses like `in[j + 0][i]`
        float* in3 = in[j + 3];

        __m128 out0 = _mm_setzero_ps();
        __m128 out1 = _mm_setzero_ps();
        __m128 out2 = _mm_setzero_ps();
        __m128 out3 = _mm_setzero_ps();

        for (size_t i=0; i < n; i += 4) {
            const __m128 vc = _mm_loadu_ps(const_vector + i);

            const __m128 x0 = _mm_loadu_ps(in0 + i);
            const __m128 x1 = _mm_loadu_ps(in1 + i);
            const __m128 x2 = _mm_loadu_ps(in2 + i);
            const __m128 x3 = _mm_loadu_ps(in3 + i);

            const __m128 d0   = _mm_sub_ps(x0, vc);
            const __m128 d0_2 = _mm_mul_ps(d0, d0);
            const __m128 d1   = _mm_sub_ps(x1, vc);
            const __m128 d1_2 = _mm_mul_ps(d1, d1);
            const __m128 d2   = _mm_sub_ps(x2, vc);
            const __m128 d2_2 = _mm_mul_ps(d2, d2);
            const __m128 d3   = _mm_sub_ps(x3, vc);
            const __m128 d3_2 = _mm_mul_ps(d3, d3);

            out0 = _mm_add_ps(out0, d0_2);
            out1 = _mm_add_ps(out1, d1_2);
            out2 = _mm_add_ps(out2, d2_2);
            out3 = _mm_add_ps(out3, d3_2);
        }

        float v[4];

        _mm_storeu_ps(v, out0);
        out[j + 0] = v[0] + v[1] + v[2] + v[3];

        _mm_storeu_ps(v, out1);
        out[j + 1] = v[0] + v[1] + v[2] + v[3];

        _mm_storeu_ps(v, out2);
        out[j + 2] = v[0] + v[1] + v[2] + v[3];

        _mm_storeu_ps(v, out3);
        out[j + 3] = v[0] + v[1] + v[2] + v[3];
    }
}

While in a naive algorithm we fetch numbers from the centroids table K * N times, here we do K * N/4, just 25%.

Evaluation

Tested procedures
scalar	baseline
SSE	hand-written distance function
SSE (custom)	multiple distances at once

CPU: Intel(R) Core(TM) i5 CPU M 540 @ 2.53GHz

Compiler: GCC version 7.3.0 (Debian 7.3.0-16)

procedure	8 vectors		16 vectors		32 vectors		64 vectors		128 vectors		256 vectors
	avg cycles	speedup	avg cycles	speedup	avg cycles	speedup	avg cycles	speedup	avg cycles	speedup	avg cycles	speedup
vector size 64
scalar	7.722	0.77	17.460	0.67	29.995	0.78	59.743	0.78	125.639	0.82	243.951	0.80
SSE	5.918	1.00	11.660	1.00	23.435	1.00	46.609	1.00	103.063	1.00	195.983	1.00
SSE (custom)	4.316	1.37	8.262	1.41	16.068	1.46	31.796	1.47	64.827	1.59	131.814	1.49
vector size 128
scalar	6.324	0.81	12.562	0.80	27.698	0.74	56.080	0.77	112.804	0.78	205.904	0.86
SSE	5.123	1.00	10.104	1.00	20.384	1.00	43.358	1.00	87.848	1.00	176.197	1.00
SSE (custom)	4.071	1.26	7.378	1.37	14.630	1.39	29.439	1.47	59.328	1.48	118.533	1.49
vector size 256
scalar	5.673	0.86	11.791	0.84	24.314	1.26	49.156	0.84	98.680	0.89	199.404	0.87
SSE	4.870	1.00	9.846	1.00	30.579	1.00	41.315	1.00	88.000	1.00	172.773	1.00
SSE (custom)	3.577	1.36	6.997	1.41	13.973	2.19	28.017	1.47	56.665	1.55	122.198	1.41
vector size 512
scalar	5.512	0.95	11.288	0.94	23.021	0.91	46.159	0.92	96.044	0.90	193.548	0.93
SSE	5.263	1.00	10.556	1.00	21.044	1.00	42.326	1.00	86.618	1.00	179.824	1.00
SSE (custom)	3.425	1.54	6.830	1.55	13.708	1.54	27.406	1.54	58.562	1.48	123.773	1.45
vector size 1024
scalar	5.537	0.93	11.006	0.93	22.110	0.95	46.855	0.90	93.629	0.92	186.649	0.92
SSE	5.148	1.00	10.225	1.00	20.995	1.00	42.357	1.00	85.820	1.00	172.297	1.00
SSE (custom)	6.028	0.85	6.840	1.49	13.595	1.54	28.767	1.47	58.689	1.46	118.137	1.46
vector size 4096
scalar	7.279	0.93	11.265	0.95	22.857	0.95	45.792	0.93	99.772	0.95	359.419	0.98
SSE	6.805	1.00	10.667	1.00	21.734	1.00	42.647	1.00	94.667	1.00	352.802	1.00
SSE (custom)	4.504	1.51	7.500	1.42	14.748	1.47	29.902	1.43	64.656	1.46	302.095	1.17
vector size 8192
scalar	8.293	0.98	13.242	0.94	24.831	0.98	53.673	0.99	180.724	0.99	408.179	1.00
SSE	8.160	1.00	12.426	1.00	24.274	1.00	52.892	1.00	178.407	1.00	406.330	1.00
SSE (custom)	4.751	1.72	7.466	1.66	15.167	1.60	32.015	1.65	145.880	1.22	358.551	1.13

Conclusions

Results surprised me. I didn't expect that such a simple reorganization of computations might be beneficial. In this case we have speedups ranging from 1.10 to 1.60. Which is pretty nice, especially in context of k-means where this operation is the most time consuming.

Source code

Sample code is available on github.