Vectors and SIMD

1. Roofline Model

Arithmetic Intensity is defined as the ratio of floating-point operations to bytes of memory accessed. Sparse matrix kernels (SpMV) have a low arithmetic intensity, whereas dense matrix kernels (DGEMM, BLAS3) have a high arithmetic intensity.

The roofline model helps visualize the performance limits of a system based on its peak computational throughput and memory bandwidth. The model consists of two main regions:

2. Vector Instruction Sets

Intel AVX512:

AXV512 Vector Addition Example

vaddps - Vector Add Packed Single-Precision Floating-Point Values

vaddps zmm1 {k1}{z}, zmm2, zmm3

This instruction adds packed single-precision floating-point values from zmm2 and zmm3, storing the result in zmm1. The operation is masked by k1, meaning only the elements corresponding to set bits in k1 are updated. The {z} suffix indicates that elements not selected by the mask will be zeroed in the destination register.


CPU vendors may provide header files (e.g., immintrin.h) to facilitate the use of these instructions in C/C++ code.

3. Working with a Compiler

Take:

1void add() { 2 for (int i = 0; i < 1024; i++) 3 C[i] = A[i] + B[i]; 4}

This will vectorise! However, doing:

1void add(int N) { 2 for (int i = 0; i < N; i++) 3 C[i] = A[i] + B[i]; 4}

Will not fully vectorise, as the compiler cannot guarantee that N is a multiple of the vector length. It will add redundant iterations alongside the vectorised loop to handle the remainder. And if:

1void add(int N, float *__restrict__ A, 2 float *__restrict__ B, 3 float *__restrict__ C) { 4 for (int i = 0; i < N; i++) 5 C[i] = A[i] + B[i]; 6}

The compiler doesn't know if the pointers are aligned. Using __restrict__ tells the compiler that the pointers do not overlap, allowing for better optimisation. If we did not do that, then we would have to add extra code to handle potential overlaps.

If the compiler does not vectorise a loop, you can use pragmas to force it:

1void add(float *c, float *a, float *b, int N) { 2 #pragma ivdep // Ignore vector dependencies (Assume no loop-carried dependencies) 3 for (int i = 0; i < N; i++) 4 c[i] = a[i] + b[i]; 5}

We can also force vector instructions with OpenMP pragmas:

1void add(float *c, float *a, float *b, int N) { 2 #pragma omp simd 3 for (int i = 0; i < N; i++) 4 c[i] = a[i] + b[i]; 5} 6 7// Or, functionwide: 8#pragma omp declare simd 9void add(float *c, float *a, float *b, int N) { 10 *c = *a + *b; 11}

We could also force vector instructions with SIMD intrinsics:

1void add(float *c, float *a, float *b, int N) { 2 __m128* pSrc1 = (__m128*) a; 3 __m128* pSrc2 = (__m128*) b; 4 __m128* pDest = (__m128*) c; 5 for (int i = 0; i < N/4; i++) 6 *pDest++ = _mm_add_ps(*pSrc1++, *pSrc2++); 7}

3.1 SIMT

Single Instruction Multiple Threads (SIMT) is a parallel computing architecture allowing multiple threads to execute the same instruction simultaneously on different data. Each lane is a thread, for example:

1#pragma omp simd 2for (int i = 0; i < N; i++) { 3 if () {...} else {...} 4 for (...) {...} 5 while (...) {...} 6 foo(...); 7}

3.2 Indirection

If we have:

1void add() { 2 for (int i = 0; i < 1024; i++) 3 C[i] = A[i] + B[ind[i]]; 4}

Then we need to use a gather instruction to load B[ind[i]]. This is less efficient than a normal load, as the data may not be contiguous in memory, leading to cache misses.

3.3 Predication

Predication allows conditional execution of instructions based on a mask. For example:

1void add() { 2 for (int i = 0; i < 1024; i++) { 3 if (A[i] > 0) 4 C[i] = A[i] + B[i]; 5 } 6}

This will now use the predicate registers to only perform the addition where A[i] > 0, avoiding branches that can disrupt vectorisation.

4. Executing Vector Instructions

We may need to break the vector operations into chunks that fit the hardware supported vector length:

TODO: Better notes on this section

Back to Home