Multicore & Cache Coherency

1. Power

Power is the critical constraint of processor performance scaling:

Dynamic Leakage is power consumed when signals change.
Static Leakage is power consumed when gates are powered on.

Dennard Scaling states that as transistors get smaller, dynamic power gets smaller. However, the static leakage now dominates power consumption, so Dennard Scaling no longer holds.

It is much more efficient to have lots of parallel units at a low clock rate and low voltage!

To reduce power, we could:

Turn off units, functional units or cores when unused.
Dynamic voltage & clock regulation.
Many cores, lower clock rate.
Turbo mode (e.g. boost clock when 1 core active).

2. Programming Models

Take example code:

1// 1. Rows in parallel
2for(i = 0; i < N; i++) {
3  parallel_for(j = 0; j < M; j++) {
4    A[i][j] = A[i-1][j] + A[i][j];
5  }
6}
7// 2. Columns in parallel
8parallel_for(i = 0; i < N; i++) {
9  for(j = 0; j < M; j++) {
10    A[i][j] = A[i][j-1] + A[i][j];
11  }
12}

With distributed memory we would use message passing to transpose the array inbetween.

2.1 OpenMP

OpenMP is a shared-memory programming model using compiler directives to specify parallel regions. For example:

1for(i = 0; i < N; i++) {
2  #pragma omp parallel for
3  for(j = 0; j < M; j++) {
4    A[i][j] = A[i-1][j] + A[i][j];
5  }
6}
7#pragma omp parallel for
8for(i = 0; i < N; i++) {
9  for(j = 0; j < M; j++) {
10    A[i][j] = A[i][j-1] + A[i][j];
11  }
12}

2.2 Manual Shared Memory Parallelism

We could also manually create threads and manage shared memory. In each thread:

1if (thread_id == 0) i = 0;
2barrier(); // Make sure i is set
3while (true) {
4  int local_i = fetch_and_add(&i);
5  if (local_i >= N) break;
6  C[local_i] = A[local_i] + B[local_i];
7}
8barrier(); // Make sure all threads done

2.3 MPI

Message Passing Interface (MPI) is a standard API for parallel programming using message passing.

MPI Interface

MPI_Init: Initialize MPI environment.

MPI_Comm_size: Get number of processes.

MPI_Comm_rank: Get rank of process current process.

MPI_Send: Send message to another process.

MPI_Recv: Receive message from another process.

MPI_Finalize: Terminate MPI environment.

MPI_Bcast: Broadcast message to all processes.

MPI_Reduce: Reduce values from all processes.

MPI_AllReduce: Reduce values and distribute result to all processes.

For example, a stencil loop (essentially a convolution):

1for (int i = 1; i < N-1; i++) {
2  for (int j = 1; j < M-1; j++) {
3    B[i][j] = 0.25 * (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]);
4  }
5}

With OpenMP, we could do:

1#pragma omp parallel for private(j) collapse(2)
2for (int i = 1; i < N-1; i++) {
3  for (int j = 1; j < M-1; j++) {
4    B[i][j] = 0.25 * (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]);
5  }
6}

With MPI, we can divide the grid, giving each process a subgrid with halo regions:

1! Compute number of processes and my rank
2CALL MPI_COMM_SIZE(comm, p, ierr)
3CALL MPI_COMM_RANK(comm, myrank, ierr)
4
5! Compute size of local block
6m = n / p
7IF (myrank.LT.p-1) THEN
8  lm = m + 1
9END IF
10
11! Find neighbours
12IF (myrank.EQ.0) THEN
13  left = MPI_PROC_NULL
14ELSE
15  left = myrank - 1
16END IF
17IF (myrank.EQ.p-1) THEN
18  right = MPI_PROC_NULL
19ELSE
20  right = myrank + 1
21END IF
22
23! Allocate local arrays
24ALLOCATE(A(0:n+1,0:m+1), B(n,m))
25
26! Main loop
27DO WHILE (.NOT.converged)
28  ! Compute boundary iterations
29  DO i = 1, n
30    B(i, 1) = 0.25 * (A(i-1, j) + A(i+1, j) + A(i, 0) + A(i, 2))
31    B(i, m) = 0.25 * (A(i-1, m) + A(i+1, m) + A(i, m-1) + A(i, m+1))
32  END DO
33
34  ! Communicate
35  CALL MPI_ISEND(B(1,1), n, MPI_REAL, left, tag, comm, req(1), ierr)
36  CALL MPI_ISEND(B(1,m), n, MPI_REAL, right, tag, comm, req(2), ierr)
37  CALL MPI_IRECV(B(1,0), n, MPI_REAL, left, tag, comm, req(3), ierr)
38  CALL MPI_IRECV(B(1,m+1), n, MPI_REAL, right, tag, comm, req(4), ierr)
39
40  ! Compute Interior
41  DO j =2, m-1
42    DO i = 1, n
43      B(i, j) = 0.25 * (A(i-1, j) + A(i+1, j) + A(i, j-1) + A(i, j+1))
44    END DO
45  END DO
46  DO j=1, m
47    DO i=1, n
48      A(i, j) = B(i, j)
49    END DO
50  END DO
51
52  ! Complete communication
53  DO i=1,4
54    CALL MPI_WAIT(req(i), status(1, i), ierr)
55  END DO
56END DO

2.4 MPI vs OpenMP

OpenMP is easy (hides comms) but unintended data sharing can lead to bugs.

MPI is explicit (you manage comms) but more complex code. It requires more copying of data, but its easier to see how to reduce communicaton.

3. Snooping Cache Coherency

The cache coherency problem occurs when multiple cores have local caches. If one core updates a memory location, other cores may have stale copies in their caches. We need to know where to find the most recent data, and when data is stale.

The goal is sequential consistency - the result of execution is the same as if operations were executed in some sequential order, and operations of each individual processor appear in this sequence in the order issued by that processor.

We could broadcast every store - but this introduces problem with multiword cache lines. Also, do we really need to broadcast every store?

Instead, we could invalidate other caches when a store occurs, forcing other cores to suffer a read miss and fetch the updated data from memory. Here, a snooping cache controller sits between a core's cache and the bus, monitoring all bus transactions and checking them against the tags of its cache.

3.1 Berkley Protocol

When a store to a cacheline occurs, broadcast an invalidation on the bus unless the cache line is exclusively owned (Dirty). Each cache line can be in one of four states:

Invalid
Valid - clean, potentially shared, unowned
Shared Dirty - modified, possibly shared, owned
Dirty - modified, unshared, owned

On a read miss:

Broadcast request on the bus.
If another cache line is Dirty or Shared Dirty, it supplies the data and sets its state to Shared Dirty, and our cache line becomes Valid.
Otherwise, data is fetched from main memory, and our cache line becomes Valid.

On a write hit, if Valid or Shared Dirty, an invalidation is sent and the local state is set to Dirty.

On a write miss:

Broadcast request on the bus.
If another cache line is Dirty or Shared Dirty, it supplies the data and sets its state to Shared Dirty, and our cache line becomes Valid.
Otherwise, data is fetched from main memory, and our cache line becomes Valid.
All other caches Invalidate their copies, and our cache line becomes Dirty.

3.2 Implementing Snooping

Since every bus transaction checks cache tags, there could be contention between bus and CPU accesses. To avoid this:

Duplicate set of tags for L1 caches just to allow checks in parallel with CPU accesses.
Use L2 cache to filter invalidations. Only works with multi-level inclusion. Many systems force cache inclusivity.

4. Memory Models

We need to know when its safe for different processors to use shared data. We need an uninterruptable primitive to fetch and update (atomic) on which to build locks, barriers, etc. Common primitives include test-and-set, fetch-and-increment and atomic exchange.

4.1 Atomics

However, its hard to have two memory accesses in one instruction, so we can split into two: load-linked/store-conditional:

load-linked (LL): Load value from memory address.
store-conditional (SC): Store value to memory address only if no updates have occurred since the last load-linked.

4.2 Locks

Using these, we can build spin locks:

1lock:
2  LI   R2, #1    ; Load 1 into R2
3  EXCH R2, 0(R1) ; Exchange R2 with memory at R1
4  BNEZ R2, lock  ;  If previous value != 0, try again

However, this generats lots of bus traffic, instead we can do:

1try:
2  LI   R2, #1    ; Load 1 into R2
3lock:
4  LW   R3, 0(R1) ; Load lock value
5  BNEZ R3, lock  ; If lock != 0, try again
6  EXCH R2, 0(R1) ; Exchange R2 with memory at R1
7  BNEZ R2, try   ; If previous value != 0, try again

Although the order in which this spin lock is acquired is undefined, it may be deterministic. In this case we can use a ticket lock:

1tlock_init(int *next_ticket, int *now_serving) {
2  *now_serving = *next_ticket = 0;
3}
4
5tlock_acquire(int *next_ticket, int *now_serving) {
6  int my_ticket = fetch_and_add(next_ticket, 1);
7  while (*now_serving != my_ticket);
8}
9
10tlock_release(int *now_serving) {
11  fetch_and_add(now_serving, 1);
12}

We want scalable locks, where the number of cache misses per acquisition is constant.

4.3 Consistency Models

Sequential Consistency means the result of execution is the same as if operations were executed in some sequential order - all memory accesses are delayed until all invalidations are complete. In reality, this is expensive.

Instead, weak consistency is used, with specific fencing instructions. Most programs are explicitly synchronised, so we only need to ensure consistency at synchronisation points.

5. Interconnects

Snooping cache coherency protocols rely on a bus, which inevitably becomes a bottleneck as the number of cores increases. To scale, we can distribute DRAM around the system using Non Uniform Memory Architecture (NUMA).

With an interconnect network, each node has its own memory. Each node has a directory that tracks the state of every block in every cache. The directory could:

Track information per memory block - simpler but more traffic.
Track information per cache block - more complex but less traffic.

The directory allows us to find the most recent copy of data (often by using a linked list structure). This could be a major bottleneck.

With ccNUMA, each node has a fragment of DRAM, every physical address has a unique home node.
With COMA, each node has a fragment of DRAM, but data can migrate between nodes adaptively.
With NUCA, cache is distributed, so access latency is non-uniform.

Back to Home

Table of Contents