GPUs
1. Motivation
If a workload contains thousands of thread, we never speculate (there is always work to be done). Now control is at a premium, we need to lower cost of spawning threads as much as possible. GPUs have:
- Many simple cores.
- Many functinoal units, implementing SIMD model.
- Much less cache per core, just thousands of threads and fast context switching.
- No sophisticated branch predictors.
2. Architecture
GPUs will contain many fetch-execute processor devices (streaming multiprocessors, SM). Each one uses fine-grained multithreading (FGMT) to run many warps per SM.
Warps VS Threads
A warp on a GPU is like a thread on a CPU.
A thread on a GPU is like a lane in SIMD.
Inside each SM there is:
- A Multithread Issuer (MT) which selects which warp to issue from in each cycle (FGMT).
- An explicitly-programmed scatchpad memory - different warps on the same SM can share data via this memory.
- An L1 Cache, with no cache coherency protocol.
Each chip has multiple DRAM channels, each of which has its own L2 Cache (so no cache coherency protocol is needed between SMs).
Single Instruction Multiple Threads (SIMT) is the execution model used by GPUs. Each lane is a thread.
3. CUDA
CUDA is a C extension for programming serial CPU code and parallel GPU code. The GPU kernels are a C function where each thread executes kernel code.
- A group of threads form a thread block (can be 1D, 2D, or 3D).
- Thread blocks are grouped into a grid (can be 1D, 2D, or 3D).
- Threads in the same block can share data via shared memory (scratchpad memory).
Example: DAXPY
1__global__ void daxpy(int n, double a, double *x, double *y) { 2 int i = blockIdx.x * blockDim.x + threadIdx.x; 3 if (i < n) 4 y[i] = a * x[i] + y[i]; 5} 6 7int main() { 8 int N = 1<<20; 9 int blockDim = 256; // Num threads per block 10 int gridDim = N / blockDim; // Num blocks in grid 11 daxpy<<<gridDim, blockDim>>>(N, 2.0, x, y); 12}Kernel invocation will launch
gridDim * blockDimthreads, in a total ofNthreads. Then:
- Each SM will contain some number of blocks. The GPU will attempt to distribute them equally.
- In each SM, each warp executes 32 threads in SIMD fashion. Each thread executes an instance of the kernel.
- Each SM is shared by many warps, and the MT will select which warp to issue in each cycle.
4. Branch Divergence
Within a warp, threads either all take the same path (good!) or diverge. Then:
- A warp serially executes each branch path, disabling threads that are not on that path (using predication).
- When all paths are executed, the warp reconverges.
5. SIMT vs SIMD
| SIMD | SIMT |
|---|---|
| One thread per lane. | Each thread may include SIMD instructions. |
| Adjacent threads acccess adjacent data for spatial locality. | Adjacent loop iterations access adjacent data for spatial locality. |
| Load instruction can result in a different address being accessed by each lane. | SIMD vector load has access to adjacent locations. |
| Coalesced loads with adjacent accesses are very fast. | Gather instructions can fetch from a different address per lane, but often serialised. |
| Branch coherance (adjacent threads ideally branch the same way) | Branch predictability (individual branches are mostly taken / not taken). |
For example, in C with OpenMP we can write the following, with very good spatial locality:
1void add(float *c, float *a, float *b) { 2 for (int i = 0; i < N; i++) { 3 #pragma omp simd 4 for (int j = 0; j < N; j++) 5 c[i][j] = a[i][j] + b[i][j]; 6 } 7}
But a GPU kernel would have terrible spatial locality because adjacent threads access different columns:
1__global__ void add(float *c, float *a, float *b, int N) { 2 int i = blockIdx.x * blockDim.x + threadIdx.x; 3 for (int j = 0; j < N; j++) 4 c[i][j] = a[i][j] + b[i][j]; 5}