Lecture from: 06.04.2023 | Video: YT
Today, we continue our discussion on Pipelined Processor Design. We will delve deeper into how to handle the critical issues that arise in pipelined processors, particularly data and control dependencies, and introduce the concept of precise exception handling, which is essential for maintaining the correct program semantics in the face of out-of-order completion that occurs in pipelines.
Review: Pipelining Challenges
Let’s quickly review the core concepts and challenges from our previous discussion on pipelining.
We saw that a pipelined processor divides the instruction execution into stages (Fetch, Decode, Execute, Memory, Writeback), allowing multiple instructions to be processed concurrently in different stages. Pipeline registers are placed between stages to isolate them and pass data and control signals.
However, this leads to several issues that prevent the pipeline from achieving its ideal speedup (equal to the number of stages):
- Stall: A condition where the pipeline stops moving, meaning instructions in some stages are temporarily halted.
- Resource Contention: When multiple instructions in different stages require the same hardware resource concurrently (e.g., two instructions needing to write to the same register file port in the same cycle).
- Dependencies: When an instruction needs a result or information that hasn’t been produced by a preceding instruction yet. This includes Data Dependencies and Control Dependencies.
- Long-Latency Operations: When an operation in a pipeline stage (like a memory access or a complex arithmetic operation) takes longer than the pipeline’s clock cycle.
Review: Data Dependence Types
Recall the different types of data dependencies:
- Flow Dependence (Read-After-Write - RAW): A true dependence where a later instruction needs a value written by an earlier instruction. This must be obeyed for correctness.
- Anti Dependence (Write-After-Read - WAR): A later instruction writes a register that an earlier instruction reads. A false dependence.
- Output Dependence (Write-After-Write - WAW): A later instruction writes a register that an earlier instruction also writes. Another false dependence.
Flow dependencies (RAW) are the most critical type to handle for correctness in pipelined processors. Anti and Output dependencies, being false dependencies arising from limited register names, can often be handled with techniques like Register Renaming (discussed later).
How to Implement Stalling
When a dependence or resource contention prevents an instruction from proceeding, the pipeline must stall.
- Mechanism: To stall an instruction in a particular stage (e.g., Decode), you disable the write enable for the pipeline register before that stage (e.g., the IF/ID register). This prevents the stalled instruction from moving forward and also prevents new instructions from being fetched into the pipeline (by disabling the PC update).
- Bubbles: To keep the pipeline moving behind the stalled instruction and make room, “invalid” instructions, or “bubbles”, are inserted into the stage following the stalled instruction. This is typically done by clearing the pipeline register after the stalled stage (e.g., clearing the ID/EX register when the instruction in the ID stage stalls).
Handling Dependencies
Handling Data Dependencies (RAW): Stall and Forwarding
As we saw, simply stalling for every RAW dependency can be very inefficient, especially if the result is available early.
- The Problem: A consumer instruction needs a value produced by a producer instruction. The producer writes the value, but it takes several cycles for that value to be written back to the register file. If the consumer needs the value before it’s written back, a stall is necessary.
- The Observation: The result value is often available before it’s written back to the register file. It might be available at the output of the ALU (EX stage) or the output of the Data Memory (MEM stage).
- The Idea: Data Forwarding (Bypassing): Add data forwarding paths (buses or wires) directly from the output of the stages where a result is available to the input of the stages where it is needed. Dependence detection logic identifies the dependency and controls multiplexers at the input of the dependent stage to select the value from the appropriate forwarding path instead of the (stale) value from the register file.
- Forwarding is Not Always Possible (MEM-EX Hazard): While forwarding resolves many RAW dependencies, there are cases where the value is simply not available early enough in the pipeline to be forwarded without violating the clock cycle time. A classic example is a Load Word (LW) instruction followed immediately by an instruction that needs the loaded data. The LW instruction reads data from memory in the MEM stage. This data is available only towards the end of the MEM stage. If the next instruction needs this data as an operand for its ALU operation in the EX stage, the data is needed before it can travel back and forth through a forwarding path from the MEM stage output to the EX stage input within the same clock cycle. Attempting to add such a path would likely lengthen the clock cycle, breaking the critical path design principle. In such cases, even with forwarding, a stall is necessary (often a 1-cycle stall for a LW-dependent instruction).
Handling Control Dependencies (Branch Prediction)
Control dependencies, primarily branches and jumps, pose a significant challenge. A conditional branch’s outcome (taken or not taken) determines the address of the next instruction to fetch. However, the Fetch stage needs to fetch an instruction every cycle to keep the pipeline full.
- The Problem: The branch outcome is typically determined late in the pipeline (e.g., in the EX stage after the comparison). By the time the outcome is known, several instructions from the predicted path have already been fetched into the pipeline. If the prediction was wrong, these instructions are incorrect and must be discarded (flushed).
- Branch Misprediction Penalty: Flushing instructions results in wasted work and empty slots (bubbles) in the pipeline, which is the performance penalty paid for incorrect branch prediction.
- Reducing Penalty: Early Branch Resolution: One way to reduce the misprediction penalty is to resolve the branch outcome earlier in the pipeline (e.g., in the ID stage). This requires adding hardware to the earlier stage to perform the branch comparison and target address calculation. While this reduces the penalty (fewer instructions are fetched down the wrong path before the correction), it adds complexity and potentially lengthens the clock cycle if the branch resolution logic becomes the new critical path in that earlier stage.
- Improving Accuracy: Smarter Branch Prediction: The frequency of mispredictions can be reduced by using more sophisticated branch prediction techniques. Instead of simple “always not taken” prediction, processors use hardware structures that record past branch behavior and use that history to predict future outcomes. These predictors can become quite complex, sometimes employing machine learning algorithms. Higher prediction accuracy leads to fewer flushes and better performance.
Pipelined Performance Example (Review)
Let’s review the performance analysis we did for a specific 5-stage pipelined MIPS processor incorporating forwarding, stalls, and early branch resolution with a probabilistic misprediction rate.
- Assumptions: We used a program profile (instruction mix) and component delay assumptions. We modeled 1-cycle stalls for MEM-EX hazards and a 2-cycle flush penalty for branch mispredictions (25% misprediction rate assumed for branches).
- Clock Cycle Time (T_c): Based on our component delays and the pipeline stage delays (considering logic for forwarding, early branch resolution, etc.), the critical path determined the clock cycle time to be 550 ps. This is faster than the single-cycle T_c (925 ps) but slower than the basic multi-cycle stage delay might suggest in isolation.
- Average CPI: Accounting for stalls and flushes based on the instruction mix and assumed stall/misprediction rates, the average CPI was calculated to be approximately 1.15. This is higher than the ideal CPI of 1.0 for a fully utilized pipeline, reflecting the performance impact of stalls and flushes.
- Execution Time: With a total of 100 billion instructions in the program, the execution time was calculated as (100 x 10^9 instructions) * (1.15 CPI) * (550 x 10^-12 s/cycle) = 63 seconds.
CPI Calculation:
Cycle time calculation:
Execution time:
Comparing the execution times:
- Single-cycle: 95 seconds
- Multi-cycle (basic, from readings): 133 seconds
- Pipelined (with forwarding, stalls, early branch resolution): 63 seconds
In this example, the pipelined implementation is the fastest. However, the speedup over the single-cycle design (95 / 63 = ~1.5X) is significantly less than the ideal 5X speedup expected from a 5-stage pipeline. This difference is due to the increased average CPI (from 1.0 to 1.15) caused by stalls and flushes, and the fact that the clock cycle time is not reduced to 1/5th of the single-cycle time due to the overhead of pipeline registers and control logic delays. The speedup over the basic multi-cycle design (133 / 63 = ~2.1X) is also less than the theoretical maximum. This reinforces that handling pipeline issues is crucial, and their imperfect handling limits the achievable performance.
Hardware vs. Software Scheduling (Review)
This discussion leads back to the fundamental trade-off between handling issues in hardware versus software.
- Software-based Scheduling (Static Scheduling): Compiler determines instruction order at compile time. Hardware executes instructions in that fixed order. Compiler tries to minimize stalls by reordering independent instructions and inserting NOPs (bubbles).
- Hardware-based Scheduling (Dynamic Scheduling): Hardware determines instruction execution order at run time, potentially reordering instructions to avoid stalls.
The compiler’s ability to minimize stalls through static scheduling is limited because it doesn’t know everything that happens at run time (e.g., variable memory latency, actual branch outcomes, cache hits/misses). Profiling can help the compiler make better guesses, but they are still predictions.
Hardware, on the other hand, knows the exact run-time conditions but adding hardware to handle all potential reorderings and dependencies dynamically increases complexity and power consumption.
Fine-Grained Multithreading: A Different Approach
Fine-grained multithreading is another approach to handling pipeline latency and dependencies.
- Idea: Instead of executing instructions from a single thread sequentially (even if pipelined), fetch instructions from different, independent threads in consecutive cycles and interleave their execution in the pipeline.
- Mechanism: The hardware must support multiple thread contexts (each with its own PC and set of registers). A thread selection logic chooses which thread’s instruction to fetch each cycle.
- Advantage: Since instructions from different threads are generally independent, there’s no need for complex data dependence forwarding or control dependence prediction/stalling logic between instructions of different threads. Latency for one thread’s instruction is hidden by executing useful work from other threads. This improves pipeline utilization and overall thread-level throughput.
- Disadvantage: A single thread’s performance suffers because its instructions are no longer processed back-to-back in the pipeline; they are interleaved with instructions from other threads. This increases the latency for that individual thread. It also adds hardware complexity for managing multiple thread contexts and the thread selection logic.
Fine-grained multithreading is a prominent technique in throughput-oriented architectures like modern GPUs, where high thread-level parallelism is available and maximizing overall throughput is more critical than minimizing the latency of a single thread.
Pipelining and Precise Exceptions: Preserving Sequential Semantics
One of the most significant complexities introduced by pipelining and out-of-order execution is handling Exceptions and Interrupts while maintaining the Precise architectural state.
- Exceptions: Events originating within an instruction’s execution (e.g., division by zero, overflow, page fault, undefined opcode).
- Interrupts: Events external to the current instruction stream, needing immediate attention (e.g., I/O device signaling completion, timer expiration, power failure).
We’ll continue this next time…
Conclusion
Today, we solidified our understanding of pipelined processor design by examining the critical issues of data and control dependencies and how they are handled using techniques like stalling, data forwarding, and branch prediction.
This concludes our lecture before the Easter break. After the break, we will dive into Out-of-Order Execution, a powerful paradigm that builds upon pipelining, forwarding, and the Reorder Buffer to achieve even higher performance by allowing instructions to execute and complete out of their original program order.
Continue here: 14 Precise Exceptions & Register Renaming