Pipelining & Caching

1. Pipelining

To demonstrate pipelining, we will introduce the MIPS instruction set and datapath.

1.1 MIPS Datapath

The MIPS instruction set is a simple instruction set with different kinds of basic instructions:

Where is the source register, is the destination register, is a further specification of the operation, and is an immediate value (constant).

Our goal is to create a machine to execute these instructions - a universal fetch decode execute machine. Some basic pseudocode for that:

1// Fetch
2instr = memory[pc]; pc += 4;
3
4// Decode
5rs1 = registers[instr.Rs1];
6rs2 = registers[instr.Rs2];
7imm = sign_extend(instr.Imm);
8
9// Execute
10op1 = instr.Op == BRANCH ? pc : rs1;
11op2 = is_immediate(instr.Op) ? imm : rs2;
12res = ALU(instr.Op, op1, op2);
13
14// Memory
15switch (instr.Op) {
16  case BRANCH: if (rs1 == 0) pc += imm * 4; break;
17  case STORE: memory[res] = rs1; break;
18  case LOAD: lmd = memory[res]; break;
19}
20
21// Writeback
22registers[instr.Rd] = instr.Op == LOAD ? lmd : res;

1.2 Pipelining the Datapath

These are the five steps of a MIPS datapath. This is a sequence of phases that can be pipelined. To do so, we can introduce buffer registers inbetween each stage to hold the intermediate values. This allows us to have multiple instructions in different stages of execution at the same time.

Each clock cycle, each stage does its work and passes the results to the next stage. This means that after an initial latency of cycles, we can complete one instruction per cycle, achieving a throughput of instruction per cycle. This lets us increase our clock rate - determined by the slowest of the pipeline stages.

Control signals are also passed along the pipeline to ensure that each stage knows what to do with the data it receives.

We can think about the pipeline as a sequence of instructions in different stages:

	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9
Instr 1	Fetch	Decode	Execute	Memory	Writebk
Instr 2		Fetch	Decode	Execute	Memory	Writebk
Instr 3			Fetch	Decode	Execute	Memory	Writebk
Instr 4				Fetch	Decode	Execute	Memory	Writebk
Instr 5					Fetch	Decode	Execute	Memory	Writebk

Although instruction latency is unchanged, throughput is increased by the number of pipeline stages. Rate limited by slowest stage. Speedup also reduced by unbalanced lengths of stages. Time to fill and drain also reduces speedup.

1.3 Hazards

Hazards prevent an instruction from executing during its designated clock cycles:

Structural hazards: Hardware cannot support the combination of instructions in simultaneous overlapped execution.
Data hazards: Instruction depends on the result of a prior instruction still in the pipeline.
Control hazards: The pipeline makes a wrong decision on a branch predictions.

1.3.1 Structural Hazards & Bubbles

In the following table we can see a structural hazard, where memory (single port) is accessed simultaneously during Cycle & :

	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9
Instr 1	Fetch	Decode	Execute		Writebk
Instr 2		Fetch	Decode	Execute		Writebk
Instr 3			Fetch	Decode	Execute	Memory	Writebk
Instr 4					Decode	Execute	Memory	Writebk
Instr 5						Decode	Execute	Memory	Writebk

To fix structural hazards, we have to introduce a pipeline bubble, delaying an instruction to avoid overlaps. However, you can see this causes a chain reaction of stalls which minimises performance.

	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9
Instr 1	Fetch	Decode	Execute		Writebk
Instr 2		Fetch	Decode	Execute		Writebk
Instr 3			Fetch	Decode	Execute	Memory	Writebk
STALL				-	-	-	-	-
Instr 4						Decode	Execute	Memory	Writebk

Instead, we can introduce an instruction buffer, to prefetch a block of instructions to avoid pipeline bubbles.

1.3.2 Data Hazards & Forwarding

Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8
Fetch	Decode	Execute	Memory
	Fetch		Execute	Memory	Writebk
		Fetch		Execute	Memory	Writebk
			Fetch		Execute	Memory	Writebk

Here, we have a data hazard where the second instruction depends on the result of the first instruction which is not yet written back. This causes stalls in the pipeline.

This is partially resolved for data hazards in the same clock cycle by writing back in the first half of the cycle and reading in the second half. However, this does not help when the dependent instruction is further down the pipeline.

To better solve this, we can use forwarding, where we take the result from a later stage in the pipeline and forward it to an earlier stage that needs it. This is possible as the ALU computes the result before the next instruction needs it. For this, we need a hardware change:

Add forwarding bypass paths.
Add multiplexers to select ALU operand sources.
Determine MUX control in decode stage, if the source reg is the target of an instruction that will not writeback in time.

This gets quickly complicated with multiple ALUs. A dataflow graph must be dynamically constructed. However, forwarding doesn't solve all data hazards:

Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8
Fetch	Decode	Execute		Writebk
	Fetch	Decode		Memory	Writebk
		Fetch	Decode		Memory	Writebk
			Fetch	Decode		Memory	Writebk

This time, the value is only available after the memory stage, so we can't forward it to the next instruction. So, we will have to introduce a load-to-use stall in this case. This needs to be implemented at decode time:

Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9
Fetch	Decode	Execute		Writebk
	Fetch	Decode	-		Memory	Writebk
		Fetch	-	Decode		Memory	Writebk
			-	Fetch	Decode		Memory	Writebk

This cannot be addressed by hardware, but is often addressed by compilers to try and load memory in earlier instructions.

1.3.3 Control Hazards

Control hazards occur when the pipeline makes a wrong decision on a branch prediction. We risk a cycle stall, as the branch outcome is discovered at the ALU stage:

Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9
Fetch	Decode	Execute	Memory	Writebk
	Fetch	Decode	Execute	Memory	Writebk
		Fetch	Decode	Execute	Memory	Writebk
			Fetch	Decode	Execute	Memory	Writebk
				Fetch	Decode	Execute	Memory	Writebk

To minimise the stall we could determine the branch in the decode stage, resulting in only a cycle stall. Or, instead, we can try to discover the branch outcome earlier, by doing branch prediction.

1.3.4 Simultaneous Multithreading

If we maintain two program counters, and execute two threads on the same core, we can eliminate most stalls. This is called Simultaneous Multithreading (SMT), and is used in modern CPUs (e.g. Intel's Hyper-Threading).

1.4 Clock Speed

We have a fundamental choice between more stages and higher clockrate, or lower clockrate and doing more per stage.

Often, main memory access is the slowest pipeline stage.

2. Caches

Since memory is much slower than CPUs, we use a memory heirarchy to bridge the speed gap. More important bits of memory are stored in faster, smaller memory. This is called a cache. This is due to the principle of locality: Programs access a relatively small portion of their address space at any instant of time. There are two types of locality:

Temporal locality: If a memory location is referenced, it will tend to be referenced again soon.
Spatial locality: If a memory location is referenced, locations with nearby addresses will tend to be referenced soon. Modern architectures are heavily reliant on locality for speed.

2.1 Direct Mapped Cache

For a byte cache, with block size , a direct mapped cache uses cache lines, each storing bytes. Each memory address is split into three fields:

The uppermost bits are the cache tag. This is used to identify if the data in the cache corresponds to the requested memory address.
The lowest bits are the byte select. This is used to select the specific byte within a cache block.
The middle bits are the index. This is used to select the specific cache line.

In general, for a direct mapped cache of capacity bytes and blocksize bytes and address size :

bits for byte select
blocks
bits for cahce index
bits for cache tag

To access memory:

Use the cache index to select a cache line.
Compare the cache tag with the tag stored in the selected cache line.
If they match, it's a cache hit. Use the byte select to get the data.
If they don't match, it's a cache miss. Fetch the block from main memory, update the cache line, and then use the byte select to get the data.

This is a bad design, as for byte blocks, any address in memory which is divisible by will map to the same cache line, causing many collisions. This is bad as some simple programs will be almost all cache misses:

1int a[256];
2int b[256];
3int r = 0;
4for (int i = 0; i < 10; i++) {
5  for (int j = 0; j < 64; j++) {
6    r += a[j] * b[j]; // This will keep swapping between two cache lines, causing many cache misses.
7  }
8}

This effect is called an associativity conflict. To fix, we can use a set associative cache.

2.2 Set Associative Cache

A -way set associative cache. This consists of direct mapped caches (called ways), operating in parallel. To access memory:

Use the cache index to select a set of cache lines (one from each way).
Two tags are compared in parallel.
If either tag matches, it's a cache hit. Use the byte select to get the data.
If both tags don't match, it's a cache miss. Fetch the block from main memory, update one of the cache lines, and then use the byte select to get the data.

However, this comes at a disadvantage:

comaprators instead of means more hardware and slower access time.
Extra MUX causes a delay for data, meaning it comes after hit/miss signal. In a direct mapped cache, data comes before hit/miss signal, meaning we can assume a hit and cancel later.

2.3 Issues in Cache Architecture

Block Placement is a question of where to place a block in the cache. The choice of block placement can significantly affect cache performance, as it determines how data is organized and accessed in the cache. In an -way associative cache, a block can be placed in any of the lines within the set determined by the cache index. More associativity means more comparators (slower, less energy efficient), better hit rate, reduced storage layout sensitivity. In fully associative cache, any block can be placed anywhere in the cache.

Block Identification - parts of the cache are used to store the data, and other parts store tags, which are used to identify which block of memory is currently stored in a particular cache line. As associativity increases, the number of bits needed for the tag decreases, as there are fewer sets in the cache. We can also decrease the size of the tag by increasing the block size.

Block Replacement - in direct mapped cache there is no choice of which block to evict. There are many different cache replacement policies (LRU, random, etc). The ideal is least soon re-used. LRU is pathalogically bad, beating random replacement only in small caches.

Write Strategy - we can write through (information is written to block and lower-level memory). This needs write buffers for lower levels. It is also more consistent as reads on cache misses cannot result in writes. However, it doesn't solve cache coherence. Or we can write back (modified cache block is only written to lower-level memory when replaced). This needs a dirty bit per block. It can asborb repeated writes to the same block, but needs more complex hardware.

3. Turing Tax

The turing tax is an overhead for the universality of a computer (computing any program rather than a specific one). A fundamental pillar of computer architecture is reducing the turing tax. In a single purpose machine:

Instruction Fetching would be completely unnecessary.
Data Routing - we may not need all forwarding paths. We can place consumer and producer closer together.
Register Access - we can wire stuff together instead of using registers.
ALU Operations - we can simplify ALU operations by using dedicated hardware for specific tasks instead of a configurable ALU.

Back to Home

Table of Contents