Vectors and SIMD
1. Roofline Model
Arithmetic Intensity is defined as the ratio of floating-point operations to bytes of memory accessed. Sparse matrix kernels (SpMV) have a low arithmetic intensity, whereas dense matrix kernels (DGEMM, BLAS3) have a high arithmetic intensity.
The roofline model helps visualize the performance limits of a system based on its peak computational throughput and memory bandwidth. The model consists of two main regions:
- Memory-Bound Region: Performance is limited by memory bandwidth. This occurs when the arithmetic intensity is low.
- Compute-Bound Region: Performance is limited by the processor's peak computational throughput.
2. Vector Instruction Sets
Intel AVX512:
vector registers ( ZMM0-ZMM31), eachbits wide. Allows execution of instructions in parallel lanes. -bit predicate registers ( K0-K7) for masking operations. (K0is special and cannot be used for masking.)- Rich set of instructions for operating on 512-bit operands.
AXV512 Vector Addition Example
vaddps- Vector Add Packed Single-Precision Floating-Point Valuesvaddps zmm1 {k1}{z}, zmm2, zmm3This instruction adds packed single-precision floating-point values from
zmm2andzmm3, storing the result inzmm1. The operation is masked byk1, meaning only the elements corresponding to set bits ink1are updated. The{z}suffix indicates that elements not selected by the mask will be zeroed in the destination register.
CPU vendors may provide header files (e.g.,
immintrin.h) to facilitate the use of these instructions in C/C++ code.
3. Working with a Compiler
Take:
1void add() { 2 for (int i = 0; i < 1024; i++) 3 C[i] = A[i] + B[i]; 4}
This will vectorise! However, doing:
1void add(int N) { 2 for (int i = 0; i < N; i++) 3 C[i] = A[i] + B[i]; 4}
Will not fully vectorise, as the compiler cannot guarantee that N is a multiple of the vector length. It will add redundant iterations alongside the vectorised loop to handle the remainder. And if:
1void add(int N, float *__restrict__ A, 2 float *__restrict__ B, 3 float *__restrict__ C) { 4 for (int i = 0; i < N; i++) 5 C[i] = A[i] + B[i]; 6}
The compiler doesn't know if the pointers are aligned. Using __restrict__ tells the compiler that the pointers do not overlap, allowing for better optimisation. If we did not do that, then we would have to add extra code to handle potential overlaps.
If the compiler does not vectorise a loop, you can use pragmas to force it:
1void add(float *c, float *a, float *b, int N) { 2 #pragma ivdep // Ignore vector dependencies (Assume no loop-carried dependencies) 3 for (int i = 0; i < N; i++) 4 c[i] = a[i] + b[i]; 5}
We can also force vector instructions with OpenMP pragmas:
1void add(float *c, float *a, float *b, int N) { 2 #pragma omp simd 3 for (int i = 0; i < N; i++) 4 c[i] = a[i] + b[i]; 5} 6 7// Or, functionwide: 8#pragma omp declare simd 9void add(float *c, float *a, float *b, int N) { 10 *c = *a + *b; 11}
We could also force vector instructions with SIMD intrinsics:
1void add(float *c, float *a, float *b, int N) { 2 __m128* pSrc1 = (__m128*) a; 3 __m128* pSrc2 = (__m128*) b; 4 __m128* pDest = (__m128*) c; 5 for (int i = 0; i < N/4; i++) 6 *pDest++ = _mm_add_ps(*pSrc1++, *pSrc2++); 7}
3.1 SIMT
Single Instruction Multiple Threads (SIMT) is a parallel computing architecture allowing multiple threads to execute the same instruction simultaneously on different data. Each lane is a thread, for example:
1#pragma omp simd 2for (int i = 0; i < N; i++) { 3 if () {...} else {...} 4 for (...) {...} 5 while (...) {...} 6 foo(...); 7}
3.2 Indirection
If we have:
1void add() { 2 for (int i = 0; i < 1024; i++) 3 C[i] = A[i] + B[ind[i]]; 4}
Then we need to use a gather instruction to load B[ind[i]]. This is less efficient than a normal load, as the data may not be contiguous in memory, leading to cache misses.
3.3 Predication
Predication allows conditional execution of instructions based on a mask. For example:
1void add() { 2 for (int i = 0; i < 1024; i++) { 3 if (A[i] > 0) 4 C[i] = A[i] + B[i]; 5 } 6}
This will now use the predicate registers to only perform the addition where A[i] > 0, avoiding branches that can disrupt vectorisation.
4. Executing Vector Instructions
We may need to break the vector operations into chunks that fit the hardware supported vector length:
- Vector pipelining would execute vector instructions serially on a simple, static pipeline. Each word is forwarded to the next functional unit (FU) as soon as its ready. FUs form a long pipelined chain.
- UOP decomposition would break down complex vector instructions into multiple micro-operations that can be executed in parallel across multiple execution units. This can be used when the provided vector length exceeds the hardware vector length. Each
-wide vector is split into -wide vectors, where is the hardware vector length. They are committed together.
TODO: Better notes on this section