Branch Prediction
1. Avoiding Branch Prediction
Control hazards are a problem in any pipelined architecture. Branches are very common. Speculative dynamic instruction scheduling enables speculating many instructions, so we need good branch prediction to avoid flushing the pipeline too often. We could try avoiding branch prediction by:
- Hyperthreading enough to avoid stalls. (No branch prediction needed)
- Predication use a p-register to indicate whether to execute an instruction. This is part of the instruction encoding. This converts a control hazard into a data dependency.
- Delayed branch - the instruction after a branch is always executed, regardless of whether the branch is taken. The compiler fills this delay slot with useful work (usually
of the time). This can be improved with cancelling branches - where the delay slot is executed with disabled writeback, and if the branch is taken/not taken, the writeback is enabled. This produces likely-taken and likely-not-taken branches, with different speculative behaviours.
2. Branch Prediction
A branch predictor fetches the predicted next instruction without stalls, before the preceeding instruction has been decoded. There are two types of branch predictors:
- Direction Predictor - predicts conditional branches as taken/not taken.
- Target Predictor - predicts the target address of a branch instruction.
3. Branch Direction Prediction
3.1 Branch History Table
A branch history table uses lower bits of the PC to index into a table of 1-bit values. indicating whether the branch was taken/not taken last time. This is a 1-bit predictor. However:
- We don't do an address check, so aliasing is possible (two branches map to the same entry).
- A misprediction occurs twice in a loop (once when entering, once when exiting).
Instead, we can use 2-bits. Now, if we encounter a not-taken branch, we increment. If we encounter a taken branch, we decrement. When the value is 0 or 1, we predict not taken. When the value is 2 or 3, we predict taken. This adds hysteresis to the prediction, making it more resilient to noise and improving accuracy. This works much better for nested loops. This generalises to an n-bit saturating up-down counter.
Most branches are highly biased. This scheme works badly for branches that arent.
3.2 G Selector
The bimodal predictor shown above uses local history to predict branching. However, we also know that many sequenced branching (e.g. many if statements) are usually correlated. A global history is the taken/not-taken history of
In a G-Selector predictor, we use the BHR to index into a table of 2-bit counters. This works because certain patterns of branching are correlated. For example, in nested loops, the inner loop branches are correlated with the outer loop branches.
An
4. Branch Target Prediction
A branch target buffer maps branch program counters to their predicted next program counter. It is indexed by lower bits of the PC, and tagged with higher bits of the PC to avoid aliasing. Each entry contains:
- Branch program counter tag
- Predicted target address
- Extra prediction state bits
This buffer is accessed in parallel with the instruction cache in the fetch stage. It is only updated when taken branches are committed. In the decode stage, we check if the prediction was correct. If not, we override the program counter for the fetch stage and squash (disable) the memory and writeback stages of the mispredicted instructions.
We can combine direction and target prediction by only overriding the program counter if the branch is predicted taken.
Often, good prediction algorithms are slow. We could use a simple branch prediction in 1 cycle, and a more complex one to re-steer it in later cycles.
4.1 Return Addresses
A function may be called from different places, so we must store the return address of each call. A branch target buffer cannot do this, as it only stores one target per branch instruction.
- On some architectures (e.g. x86)
jsr(Jump to SubRoutine) sstores the return address on a dedicated return address stack. - In MIPS,
jal(Jump And Link) stores the return address in a register ($ra). Up to the compiler to save$rawhen nesting function calls.
A stack structure should make it very easy to predict return addresses. This attempts to mirror a program's call-return stack, updated when call and return functions are executed (after misprediction detection). The top of the stack is used as the predicted return address. This could go wrong by:
- Call stack is deeper than return address stack, so we would have to use the branch target buffer, which may be wrong.
- The return address was overwritten (e.g. by
jal), so the return address stack is now wrong. - The stack pointer was changed (e.g. by
pop/push), so the return address stack is now wrong. This could happen by switching threads.