Dynamic Scheduling
1. Instruction Parallelism
We want to allow instructions behind a stall to proceed. This enables out-of-order execution and allows out-of-order completion. We need to distinguish between when an instruction is issued, begins exeuction and completes execution. In a dynamically scheduled pipeline, all instructions are issued in-order.
Execution order is constrained by data dependencies, name dependencies and control dependencies. Instruction
- Read After Write (RAW) Hazard:
reads a location that writes to. This is a true dependency. (Data dependency) - Write After Read (WAR) Hazard:
writes to a location that reads from. This is an anti-dependency. (Name dependency) - Write After Write (WAW) Hazard:
writes to a location that writes to. This is an output dependency. (Name dependency)
| RAW Hazard | WAR Hazard | WAW Hazard | |
|---|---|---|---|
2. Tomasulo
To enable dynamic scheduling, we need to:
- Change instruction decode step to instruction issue. This decodes, then checks for structural hazards.
- Introduce read operands stage before execute. This waits until there are no data hazards, then reads the operands.
In Tomasulo's design, registers each have a tag that indicates which unit will produce the value. If the tag is empty, the register contains the most recent value. If not, it is waiting for a value to be produced. Each unit is given a reservation station which waits for correct operands to be available and then executes the instruction. When an instruction is issued:
- ISSUE: An idle reservation station is selected. If the operands are stored at valid registers (no tag), set them. Set the destination register's tag to the reservation station's ID. Overwrite if necessary.
- EXECUTE: If an operand becomes available on the control bus with a matching tag, set it. When all the operands are available, execute the instruction.
- WRITE RESULT: When the result is available, broadcast it and the reservation station's ID on the control bus. This will update the registers and waiting reservation stations.
We have two databuses:
- Normal DB: Delivering operands, tags and instructions at the issue stage.
- Common DB: Delivering results and tags at the write result stage.
This design is quite complicated, having many associative stores at high speed. Performance is limited by the common DB, due to its high wiring density. Additionally, we can only handle one completion per cycle. Additionally, it has non precise interrupts.
Power of Tomasulo
Tomasulo's scheme can even execute loop iterations in parallel. It effectively implements register renaming with the CDB! It is building the data flow dependency graph on the fly.
2.1 Reorder Buffer
However, Tomasulo's scheme doesn't handle exceptions such as page faults & divide by zero (we don't know when the program errored). We also can't rollback after a branch misprediction, as instructions may have already written their results. To fix these, we can add a commit stage to the Tomasulo pipeline, and maintain a reorder buffer:
- During issue, we also allocate an entry in the reorder buffer (ROB). We give one entry per in-flight instruction. This holds the destination register, and either the value (if ready) or a tag (if not ready). The reorder buffer is a queue to maintain program order.
- During writeback, we also update the ROB entry with a matching tag.
- During commit, we process the ROB entries in order. Each instruction waits its turn and commits when operands are completed. Committed registers (duplicates of the original registers) are updated with values from the ROB.
During commit, if the prediction does not match the actual result in ROB, we flush the ROB, reset issue-side registers to committed registers, and restart fetching from the correct address. However:
- We do not consider the misprediction until the instruction reaches the head of the ROB.
- We cannot commit more than one instruction per cycle.
2.2 Speculative Stores & Loads
We need to make sure stores aren't sent to memory until the instruction is committed. But, we need to make sure loads on the same location still work.
Store to Load Forwarding is when we have an uncommitted store followed by a load to the same address. The load can get the value from the store buffer instead of memory. The problem is that real computers use computed addresses which means we don't know an address until execution. We could speculate that the load and store addresses are different, and if they turn out to be the same, we flush and restart.
2.3 Register Update Unit
Register Update Units (RUUs) combine instruction tracking and register updates in one structure for out-of-order execution. Reorder Buffers (ROBs), by contrast, store results separately to ensure in-order retirement and precise exceptions. RUUs integrate more functions, while ROBs simplify commit handling.
In ROB, registers and ROB entries have a tag, so every register, ROB entry and reservation station needs a comparator to monitor the databus. In RUU, the tags are the ROB entry numbers, so the ROB is indexed by the tag on the databus.