Multicore & Cache Coherency

1. Power

Power is the critical constraint of processor performance scaling:

Dennard Scaling states that as transistors get smaller, dynamic power gets smaller. However, the static leakage now dominates power consumption, so Dennard Scaling no longer holds.

It is much more efficient to have lots of parallel units at a low clock rate and low voltage!

To reduce power, we could:

2. Programming Models

Take example code:

1// 1. Rows in parallel 2for(i = 0; i < N; i++) { 3 parallel_for(j = 0; j < M; j++) { 4 A[i][j] = A[i-1][j] + A[i][j]; 5 } 6} 7// 2. Columns in parallel 8parallel_for(i = 0; i < N; i++) { 9 for(j = 0; j < M; j++) { 10 A[i][j] = A[i][j-1] + A[i][j]; 11 } 12}

With distributed memory we would use message passing to transpose the array inbetween.

2.1 OpenMP

OpenMP is a shared-memory programming model using compiler directives to specify parallel regions. For example:

1for(i = 0; i < N; i++) { 2 #pragma omp parallel for 3 for(j = 0; j < M; j++) { 4 A[i][j] = A[i-1][j] + A[i][j]; 5 } 6} 7#pragma omp parallel for 8for(i = 0; i < N; i++) { 9 for(j = 0; j < M; j++) { 10 A[i][j] = A[i][j-1] + A[i][j]; 11 } 12}

2.2 Manual Shared Memory Parallelism

We could also manually create threads and manage shared memory. In each thread:

1if (thread_id == 0) i = 0; 2barrier(); // Make sure i is set 3while (true) { 4 int local_i = fetch_and_add(&i); 5 if (local_i >= N) break; 6 C[local_i] = A[local_i] + B[local_i]; 7} 8barrier(); // Make sure all threads done

2.3 MPI

Message Passing Interface (MPI) is a standard API for parallel programming using message passing.

MPI Interface

  • MPI_Init: Initialize MPI environment.
  • MPI_Comm_size: Get number of processes.
  • MPI_Comm_rank: Get rank of process current process.
  • MPI_Send: Send message to another process.
  • MPI_Recv: Receive message from another process.
  • MPI_Finalize: Terminate MPI environment.
  • MPI_Bcast: Broadcast message to all processes.
  • MPI_Reduce: Reduce values from all processes.
  • MPI_AllReduce: Reduce values and distribute result to all processes.

For example, a stencil loop (essentially a convolution):

1for (int i = 1; i < N-1; i++) { 2 for (int j = 1; j < M-1; j++) { 3 B[i][j] = 0.25 * (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]); 4 } 5}

With OpenMP, we could do:

1#pragma omp parallel for private(j) collapse(2) 2for (int i = 1; i < N-1; i++) { 3 for (int j = 1; j < M-1; j++) { 4 B[i][j] = 0.25 * (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]); 5 } 6}

With MPI, we can divide the grid, giving each process a subgrid with halo regions:

1! Compute number of processes and my rank 2CALL MPI_COMM_SIZE(comm, p, ierr) 3CALL MPI_COMM_RANK(comm, myrank, ierr) 4 5! Compute size of local block 6m = n / p 7IF (myrank.LT.p-1) THEN 8 lm = m + 1 9END IF 10 11! Find neighbours 12IF (myrank.EQ.0) THEN 13 left = MPI_PROC_NULL 14ELSE 15 left = myrank - 1 16END IF 17IF (myrank.EQ.p-1) THEN 18 right = MPI_PROC_NULL 19ELSE 20 right = myrank + 1 21END IF 22 23! Allocate local arrays 24ALLOCATE(A(0:n+1,0:m+1), B(n,m)) 25 26! Main loop 27DO WHILE (.NOT.converged) 28 ! Compute boundary iterations 29 DO i = 1, n 30 B(i, 1) = 0.25 * (A(i-1, j) + A(i+1, j) + A(i, 0) + A(i, 2)) 31 B(i, m) = 0.25 * (A(i-1, m) + A(i+1, m) + A(i, m-1) + A(i, m+1)) 32 END DO 33 34 ! Communicate 35 CALL MPI_ISEND(B(1,1), n, MPI_REAL, left, tag, comm, req(1), ierr) 36 CALL MPI_ISEND(B(1,m), n, MPI_REAL, right, tag, comm, req(2), ierr) 37 CALL MPI_IRECV(B(1,0), n, MPI_REAL, left, tag, comm, req(3), ierr) 38 CALL MPI_IRECV(B(1,m+1), n, MPI_REAL, right, tag, comm, req(4), ierr) 39 40 ! Compute Interior 41 DO j =2, m-1 42 DO i = 1, n 43 B(i, j) = 0.25 * (A(i-1, j) + A(i+1, j) + A(i, j-1) + A(i, j+1)) 44 END DO 45 END DO 46 DO j=1, m 47 DO i=1, n 48 A(i, j) = B(i, j) 49 END DO 50 END DO 51 52 ! Complete communication 53 DO i=1,4 54 CALL MPI_WAIT(req(i), status(1, i), ierr) 55 END DO 56END DO

2.4 MPI vs OpenMP

OpenMP is easy (hides comms) but unintended data sharing can lead to bugs.

MPI is explicit (you manage comms) but more complex code. It requires more copying of data, but its easier to see how to reduce communicaton.

3. Snooping Cache Coherency

The cache coherency problem occurs when multiple cores have local caches. If one core updates a memory location, other cores may have stale copies in their caches. We need to know where to find the most recent data, and when data is stale.

The goal is sequential consistency - the result of execution is the same as if operations were executed in some sequential order, and operations of each individual processor appear in this sequence in the order issued by that processor.

We could broadcast every store - but this introduces problem with multiword cache lines. Also, do we really need to broadcast every store?

Instead, we could invalidate other caches when a store occurs, forcing other cores to suffer a read miss and fetch the updated data from memory. Here, a snooping cache controller sits between a core's cache and the bus, monitoring all bus transactions and checking them against the tags of its cache.

3.1 Berkley Protocol

When a store to a cacheline occurs, broadcast an invalidation on the bus unless the cache line is exclusively owned (Dirty). Each cache line can be in one of four states:

On a read miss:

  1. Broadcast request on the bus.
  2. If another cache line is Dirty or Shared Dirty, it supplies the data and sets its state to Shared Dirty, and our cache line becomes Valid.
  3. Otherwise, data is fetched from main memory, and our cache line becomes Valid.

On a write hit, if Valid or Shared Dirty, an invalidation is sent and the local state is set to Dirty.

On a write miss:

  1. Broadcast request on the bus.
  2. If another cache line is Dirty or Shared Dirty, it supplies the data and sets its state to Shared Dirty, and our cache line becomes Valid.
  3. Otherwise, data is fetched from main memory, and our cache line becomes Valid.
  4. All other caches Invalidate their copies, and our cache line becomes Dirty.

3.2 Implementing Snooping

Since every bus transaction checks cache tags, there could be contention between bus and CPU accesses. To avoid this:

4. Memory Models

We need to know when its safe for different processors to use shared data. We need an uninterruptable primitive to fetch and update (atomic) on which to build locks, barriers, etc. Common primitives include test-and-set, fetch-and-increment and atomic exchange.

4.1 Atomics

However, its hard to have two memory accesses in one instruction, so we can split into two: load-linked/store-conditional:

4.2 Locks

Using these, we can build spin locks:

1lock: 2 LI R2, #1 ; Load 1 into R2 3 EXCH R2, 0(R1) ; Exchange R2 with memory at R1 4 BNEZ R2, lock ; If previous value != 0, try again

However, this generats lots of bus traffic, instead we can do:

1try: 2 LI R2, #1 ; Load 1 into R2 3lock: 4 LW R3, 0(R1) ; Load lock value 5 BNEZ R3, lock ; If lock != 0, try again 6 EXCH R2, 0(R1) ; Exchange R2 with memory at R1 7 BNEZ R2, try ; If previous value != 0, try again

Although the order in which this spin lock is acquired is undefined, it may be deterministic. In this case we can use a ticket lock:

1tlock_init(int *next_ticket, int *now_serving) { 2 *now_serving = *next_ticket = 0; 3} 4 5tlock_acquire(int *next_ticket, int *now_serving) { 6 int my_ticket = fetch_and_add(next_ticket, 1); 7 while (*now_serving != my_ticket); 8} 9 10tlock_release(int *now_serving) { 11 fetch_and_add(now_serving, 1); 12}

We want scalable locks, where the number of cache misses per acquisition is constant.

4.3 Consistency Models

Sequential Consistency means the result of execution is the same as if operations were executed in some sequential order - all memory accesses are delayed until all invalidations are complete. In reality, this is expensive.

Instead, weak consistency is used, with specific fencing instructions. Most programs are explicitly synchronised, so we only need to ensure consistency at synchronisation points.

5. Interconnects

Snooping cache coherency protocols rely on a bus, which inevitably becomes a bottleneck as the number of cores increases. To scale, we can distribute DRAM around the system using Non Uniform Memory Architecture (NUMA).

With an interconnect network, each node has its own memory. Each node has a directory that tracks the state of every block in every cache. The directory could:

The directory allows us to find the most recent copy of data (often by using a linked list structure). This could be a major bottleneck.

Back to Home