Vectors and SIMD

1. Roofline Model

Arithmetic Intensity is defined as the ratio of floating-point operations to bytes of memory accessed. Sparse matrix kernels (SpMV) have a low arithmetic intensity, whereas dense matrix kernels (DGEMM, BLAS3) have a high arithmetic intensity.

The roofline model helps visualize the performance limits of a system based on its peak computational throughput and memory bandwidth. The model consists of two main regions:

Memory-Bound Region: Performance is limited by memory bandwidth. This occurs when the arithmetic intensity is low.
Compute-Bound Region: Performance is limited by the processor's peak computational throughput.

2. Vector Instruction Sets

Intel AVX512:

vector registers (ZMM0-ZMM31), each bits wide. Allows execution of instructions in parallel lanes.
-bit predicate registers (K0-K7) for masking operations. (K0 is special and cannot be used for masking.)
Rich set of instructions for operating on 512-bit operands.

AXV512 Vector Addition Example

vaddps - Vector Add Packed Single-Precision Floating-Point Values
vaddps zmm1 {k1}{z}, zmm2, zmm3
This instruction adds packed single-precision floating-point values from zmm2 and zmm3, storing the result in zmm1. The operation is masked by k1, meaning only the elements corresponding to set bits in k1 are updated. The {z} suffix indicates that elements not selected by the mask will be zeroed in the destination register.

CPU vendors may provide header files (e.g., immintrin.h) to facilitate the use of these instructions in C/C++ code.

3. Working with a Compiler

Take:

1void add() {
2  for (int i = 0; i < 1024; i++)
3    C[i] = A[i] + B[i];
4}

This will vectorise! However, doing:

1void add(int N) {
2  for (int i = 0; i < N; i++)
3    C[i] = A[i] + B[i];
4}

Will not fully vectorise, as the compiler cannot guarantee that N is a multiple of the vector length. It will add redundant iterations alongside the vectorised loop to handle the remainder. And if:

1void add(int N, float *__restrict__ A,
2                float *__restrict__ B,
3                float *__restrict__ C) {
4  for (int i = 0; i < N; i++)
5    C[i] = A[i] + B[i];
6}

The compiler doesn't know if the pointers are aligned. Using __restrict__ tells the compiler that the pointers do not overlap, allowing for better optimisation. If we did not do that, then we would have to add extra code to handle potential overlaps.

If the compiler does not vectorise a loop, you can use pragmas to force it:

1void add(float *c, float *a, float *b, int N) {
2  #pragma ivdep // Ignore vector dependencies (Assume no loop-carried dependencies)
3  for (int i = 0; i < N; i++)
4    c[i] = a[i] + b[i];
5}

We can also force vector instructions with OpenMP pragmas:

1void add(float *c, float *a, float *b, int N) {
2  #pragma omp simd
3  for (int i = 0; i < N; i++)
4    c[i] = a[i] + b[i];
5}
6
7// Or, functionwide:
8#pragma omp declare simd
9void add(float *c, float *a, float *b, int N) {
10  *c = *a + *b;
11}

We could also force vector instructions with SIMD intrinsics:

1void add(float *c, float *a, float *b, int N) {
2  __m128* pSrc1 = (__m128*) a;
3  __m128* pSrc2 = (__m128*) b;
4  __m128* pDest = (__m128*) c;
5  for (int i = 0; i < N/4; i++)
6    *pDest++ = _mm_add_ps(*pSrc1++, *pSrc2++);
7}

3.1 SIMT

Single Instruction Multiple Threads (SIMT) is a parallel computing architecture allowing multiple threads to execute the same instruction simultaneously on different data. Each lane is a thread, for example:

1#pragma omp simd
2for (int i = 0; i < N; i++) {
3  if () {...} else {...}
4  for (...) {...}
5  while (...) {...}
6  foo(...);
7}

3.2 Indirection

If we have:

1void add() {
2  for (int i = 0; i < 1024; i++)
3    C[i] = A[i] + B[ind[i]];
4}

Then we need to use a gather instruction to load B[ind[i]]. This is less efficient than a normal load, as the data may not be contiguous in memory, leading to cache misses.

3.3 Predication

Predication allows conditional execution of instructions based on a mask. For example:

1void add() {
2  for (int i = 0; i < 1024; i++) {
3    if (A[i] > 0)
4      C[i] = A[i] + B[i];
5  }
6}

This will now use the predicate registers to only perform the addition where A[i] > 0, avoiding branches that can disrupt vectorisation.

4. Executing Vector Instructions

We may need to break the vector operations into chunks that fit the hardware supported vector length:

Vector pipelining would execute vector instructions serially on a simple, static pipeline. Each word is forwarded to the next functional unit (FU) as soon as its ready. FUs form a long pipelined chain.
UOP decomposition would break down complex vector instructions into multiple micro-operations that can be executed in parallel across multiple execution units. This can be used when the provided vector length exceeds the hardware vector length. Each -wide vector is split into -wide vectors, where is the hardware vector length. They are committed together.

TODO: Better notes on this section

Back to Home

Table of Contents