Branch Prediction

1. Avoiding Branch Prediction

Control hazards are a problem in any pipelined architecture. Branches are very common. Speculative dynamic instruction scheduling enables speculating many instructions, so we need good branch prediction to avoid flushing the pipeline too often. We could try avoiding branch prediction by:

2. Branch Prediction

A branch predictor fetches the predicted next instruction without stalls, before the preceeding instruction has been decoded. There are two types of branch predictors:

3. Branch Direction Prediction

3.1 Branch History Table

A branch history table uses lower bits of the PC to index into a table of 1-bit values. indicating whether the branch was taken/not taken last time. This is a 1-bit predictor. However:

Instead, we can use 2-bits. Now, if we encounter a not-taken branch, we increment. If we encounter a taken branch, we decrement. When the value is 0 or 1, we predict not taken. When the value is 2 or 3, we predict taken. This adds hysteresis to the prediction, making it more resilient to noise and improving accuracy. This works much better for nested loops. This generalises to an n-bit saturating up-down counter.

Most branches are highly biased. This scheme works badly for branches that arent.

3.2 G Selector

The bimodal predictor shown above uses local history to predict branching. However, we also know that many sequenced branching (e.g. many if statements) are usually correlated. A global history is the taken/not-taken history of previous branches using an bit Branch History Register (BHR).

In a G-Selector predictor, we use the BHR to index into a table of 2-bit counters. This works because certain patterns of branching are correlated. For example, in nested loops, the inner loop branches are correlated with the outer loop branches.

An G-Selector uses has -bit BHTs.

4. Branch Target Prediction

A branch target buffer maps branch program counters to their predicted next program counter. It is indexed by lower bits of the PC, and tagged with higher bits of the PC to avoid aliasing. Each entry contains:

This buffer is accessed in parallel with the instruction cache in the fetch stage. It is only updated when taken branches are committed. In the decode stage, we check if the prediction was correct. If not, we override the program counter for the fetch stage and squash (disable) the memory and writeback stages of the mispredicted instructions.

We can combine direction and target prediction by only overriding the program counter if the branch is predicted taken.

Often, good prediction algorithms are slow. We could use a simple branch prediction in 1 cycle, and a more complex one to re-steer it in later cycles.

4.1 Return Addresses

A function may be called from different places, so we must store the return address of each call. A branch target buffer cannot do this, as it only stores one target per branch instruction.

A stack structure should make it very easy to predict return addresses. This attempts to mirror a program's call-return stack, updated when call and return functions are executed (after misprediction detection). The top of the stack is used as the predicted return address. This could go wrong by:

Back to Home