Pipelining & Caching

1. Pipelining

To demonstrate pipelining, we will introduce the MIPS instruction set and datapath.

1.1 MIPS Datapath

The MIPS instruction set is a simple instruction set with different kinds of basic instructions:

Where is the source register, is the destination register, is a further specification of the operation, and is an immediate value (constant).

Our goal is to create a machine to execute these instructions - a universal fetch decode execute machine. Some basic pseudocode for that:

1// Fetch 2instr = memory[pc]; pc += 4; 3 4// Decode 5rs1 = registers[instr.Rs1]; 6rs2 = registers[instr.Rs2]; 7imm = sign_extend(instr.Imm); 8 9// Execute 10op1 = instr.Op == BRANCH ? pc : rs1; 11op2 = is_immediate(instr.Op) ? imm : rs2; 12res = ALU(instr.Op, op1, op2); 13 14// Memory 15switch (instr.Op) { 16 case BRANCH: if (rs1 == 0) pc += imm * 4; break; 17 case STORE: memory[res] = rs1; break; 18 case LOAD: lmd = memory[res]; break; 19} 20 21// Writeback 22registers[instr.Rd] = instr.Op == LOAD ? lmd : res;

1.2 Pipelining the Datapath

These are the five steps of a MIPS datapath. This is a sequence of phases that can be pipelined. To do so, we can introduce buffer registers inbetween each stage to hold the intermediate values. This allows us to have multiple instructions in different stages of execution at the same time.

Each clock cycle, each stage does its work and passes the results to the next stage. This means that after an initial latency of cycles, we can complete one instruction per cycle, achieving a throughput of instruction per cycle. This lets us increase our clock rate - determined by the slowest of the pipeline stages.

Control signals are also passed along the pipeline to ensure that each stage knows what to do with the data it receives.

We can think about the pipeline as a sequence of instructions in different stages:

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9
Instr 1FetchDecodeExecuteMemoryWritebk
Instr 2FetchDecodeExecuteMemoryWritebk
Instr 3FetchDecodeExecuteMemoryWritebk
Instr 4FetchDecodeExecuteMemoryWritebk
Instr 5FetchDecodeExecuteMemoryWritebk

Although instruction latency is unchanged, throughput is increased by the number of pipeline stages. Rate limited by slowest stage. Speedup also reduced by unbalanced lengths of stages. Time to fill and drain also reduces speedup.

1.3 Hazards

Hazards prevent an instruction from executing during its designated clock cycles:

1.3.1 Structural Hazards & Bubbles

In the following table we can see a structural hazard, where memory (single port) is accessed simultaneously during Cycle & :

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9
Instr 1FetchDecodeExecuteWritebk
Instr 2FetchDecodeExecuteWritebk
Instr 3FetchDecodeExecuteMemoryWritebk
Instr 4DecodeExecuteMemoryWritebk
Instr 5DecodeExecuteMemoryWritebk

To fix structural hazards, we have to introduce a pipeline bubble, delaying an instruction to avoid overlaps. However, you can see this causes a chain reaction of stalls which minimises performance.

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9
Instr 1FetchDecodeExecuteWritebk
Instr 2FetchDecodeExecuteWritebk
Instr 3FetchDecodeExecuteMemoryWritebk
STALL-----
Instr 4DecodeExecuteMemoryWritebk

Instead, we can introduce an instruction buffer, to prefetch a block of instructions to avoid pipeline bubbles.

1.3.2 Data Hazards & Forwarding

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8
FetchDecodeExecuteMemory
FetchExecuteMemoryWritebk
FetchExecuteMemoryWritebk
FetchExecuteMemoryWritebk

Here, we have a data hazard where the second instruction depends on the result of the first instruction which is not yet written back. This causes stalls in the pipeline.

This is partially resolved for data hazards in the same clock cycle by writing back in the first half of the cycle and reading in the second half. However, this does not help when the dependent instruction is further down the pipeline.

To better solve this, we can use forwarding, where we take the result from a later stage in the pipeline and forward it to an earlier stage that needs it. This is possible as the ALU computes the result before the next instruction needs it. For this, we need a hardware change:

This gets quickly complicated with multiple ALUs. A dataflow graph must be dynamically constructed. However, forwarding doesn't solve all data hazards:

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8
FetchDecodeExecuteWritebk
FetchDecodeMemoryWritebk
FetchDecodeMemoryWritebk
FetchDecodeMemoryWritebk

This time, the value is only available after the memory stage, so we can't forward it to the next instruction. So, we will have to introduce a load-to-use stall in this case. This needs to be implemented at decode time:

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9
FetchDecodeExecuteWritebk
FetchDecode-MemoryWritebk
Fetch-DecodeMemoryWritebk
-FetchDecodeMemoryWritebk

This cannot be addressed by hardware, but is often addressed by compilers to try and load memory in earlier instructions.

1.3.3 Control Hazards

Control hazards occur when the pipeline makes a wrong decision on a branch prediction. We risk a cycle stall, as the branch outcome is discovered at the ALU stage:

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9
FetchDecodeExecuteMemoryWritebk
FetchDecodeExecuteMemoryWritebk
FetchDecodeExecuteMemoryWritebk
FetchDecodeExecuteMemoryWritebk
FetchDecodeExecuteMemoryWritebk

To minimise the stall we could determine the branch in the decode stage, resulting in only a cycle stall. Or, instead, we can try to discover the branch outcome earlier, by doing branch prediction.

1.3.4 Simultaneous Multithreading

If we maintain two program counters, and execute two threads on the same core, we can eliminate most stalls. This is called Simultaneous Multithreading (SMT), and is used in modern CPUs (e.g. Intel's Hyper-Threading).

1.4 Clock Speed

We have a fundamental choice between more stages and higher clockrate, or lower clockrate and doing more per stage.

Often, main memory access is the slowest pipeline stage.

2. Caches

Since memory is much slower than CPUs, we use a memory heirarchy to bridge the speed gap. More important bits of memory are stored in faster, smaller memory. This is called a cache. This is due to the principle of locality: Programs access a relatively small portion of their address space at any instant of time. There are two types of locality:

2.1 Direct Mapped Cache

For a byte cache, with block size , a direct mapped cache uses cache lines, each storing bytes. Each memory address is split into three fields:

In general, for a direct mapped cache of capacity bytes and blocksize bytes and address size :

To access memory:

This is a bad design, as for byte blocks, any address in memory which is divisible by will map to the same cache line, causing many collisions. This is bad as some simple programs will be almost all cache misses:

1int a[256]; 2int b[256]; 3int r = 0; 4for (int i = 0; i < 10; i++) { 5 for (int j = 0; j < 64; j++) { 6 r += a[j] * b[j]; // This will keep swapping between two cache lines, causing many cache misses. 7 } 8}

This effect is called an associativity conflict. To fix, we can use a set associative cache.

2.2 Set Associative Cache

A -way set associative cache. This consists of direct mapped caches (called ways), operating in parallel. To access memory:

However, this comes at a disadvantage:

2.3 Issues in Cache Architecture

Block Placement is a question of where to place a block in the cache. The choice of block placement can significantly affect cache performance, as it determines how data is organized and accessed in the cache. In an -way associative cache, a block can be placed in any of the lines within the set determined by the cache index. More associativity means more comparators (slower, less energy efficient), better hit rate, reduced storage layout sensitivity. In fully associative cache, any block can be placed anywhere in the cache.

Block Identification - parts of the cache are used to store the data, and other parts store tags, which are used to identify which block of memory is currently stored in a particular cache line. As associativity increases, the number of bits needed for the tag decreases, as there are fewer sets in the cache. We can also decrease the size of the tag by increasing the block size.

Block Replacement - in direct mapped cache there is no choice of which block to evict. There are many different cache replacement policies (LRU, random, etc). The ideal is least soon re-used. LRU is pathalogically bad, beating random replacement only in small caches.

Write Strategy - we can write through (information is written to block and lower-level memory). This needs write buffers for lower levels. It is also more consistent as reads on cache misses cannot result in writes. However, it doesn't solve cache coherence. Or we can write back (modified cache block is only written to lower-level memory when replaced). This needs a dirty bit per block. It can asborb repeated writes to the same block, but needs more complex hardware.

3. Turing Tax

The turing tax is an overhead for the universality of a computer (computing any program rather than a specific one). A fundamental pillar of computer architecture is reducing the turing tax. In a single purpose machine:

Back to Home