12 Pipelining, Data Dependencies

Lecture from: 31.03.2023 | Video: YT

Today, we build upon our previous discussions on microarchitecture by introducing Pipelining, a fundamental technique for dramatically improving processor performance through increased concurrency. We will understand the basic idea, its benefits, the issues it introduces, and fundamental techniques for handling some of these issues.

Review: Single-Cycle Microprocessor Limitations

Let’s quickly revisit the single-cycle microarchitecture we discussed.

We established that while simple in concept, the single-cycle design’s clock cycle time is limited by the longest instruction’s execution time through the combinational logic. This makes it slow and inefficient, as all instructions, even simple ones, take as long as the slowest instruction.

We then improved this with the multi-cycle microarchitecture by breaking instruction execution into multiple cycles, reducing the clock cycle time.

While the multi-cycle approach allows for a faster clock and component reuse, a key limitation remains: Limited Concurrency. At any given clock cycle, only one instruction is actively progressing through its execution phases in the datapath. Many hardware resources are idle while other parts of the datapath are busy (e.g., the fetch logic is idle during execute/memory stages, the ALU is idle during memory access stages).

Pipelining: The Basic Idea

The core idea of Pipelining is to exploit this idle hardware to increase the overall throughput of the processor.

Goal: Achieve more concurrency in instruction processing, leading to higher instruction throughput (more instructions completed per unit of time), with a relatively small increase in hardware cost.
Idea: Divide the instruction processing cycle (Fetch, Decode, Execute, Memory, Writeback) into distinct “stages”. Design the hardware such that there are enough resources to process one instruction in each stage concurrently. Then, process a different instruction in each stage simultaneously. Instructions consecutive in program order are processed in consecutive stages of the pipeline.

This is analogous to an assembly line. Instead of one person building an entire car from start to finish (like a single-cycle or multi-cycle processor), multiple people work on different cars in parallel at different stations along the assembly line.

Analogies for Pipelining

Let’s use familiar analogies to understand the concept using a laundry analogy:

Steps: Wash, Dry, Fold, Put Away. Assume each step takes 30 minutes.
Sequential (Non-pipelined): Processing 4 loads takes 4 loads * (4 steps/load * 30 mins/step) = 8 hours. You finish one load every 2 hours.
Pipelined: Once the first load finishes washing, you move it to the dryer and start washing the second load. Once the first load finishes drying, you move it to folding, move the second load to the dryer, and start washing the third load.
Result (Ideal Pipelined): After an initial fill-up period, you finish one load every 30 minutes (the time of the slowest stage). The throughput is 4 loads / 2 hours = 2 loads/hour in sequential, versus 1 load / 30 mins = 2 loads/hour * 4 = 8 loads/hour in pipelined (steady state). The throughput increases by 4X, the number of stages. The latency for a single load remains 2 hours.
In Practice (Slowest Stage Bottleneck): If the dryer takes 1 hour instead of 30 minutes, it becomes the bottleneck. In the sequential case, each load now takes 2.5 hours. In the pipelined case, you can’t start drying the next load until the previous one finishes. The dryer now dictates the pace – you finish one load every hour, not every 30 minutes. The slowest stage determines the pipeline’s throughput.
Solution to Bottleneck: To restore throughput, you can either make the slowest stage faster (buy a faster dryer - expensive, potentially technologically limited) or add more resources to the slowest stage (buy another dryer - increases hardware cost, but restores throughput by parallelizing the bottleneck stage).

Pipelining Instruction Processing

We apply these same principles to the instruction processing cycle.

Instruction Processing Cycle: We typically divide it into stages: Fetch, Decode, Execute, Memory, Writeback (IF, ID, EX, MEM, WB).
Pipelined Processor: We design dedicated hardware for each stage. Instructions move from one stage to the next every clock cycle. While Instruction 1 is in the EX stage, Instruction 2 is in the ID stage, and Instruction 3 is in the IF stage.

Dividing the Single-Cycle Datapath Into Stages

We can conceptually divide our single-cycle datapath into pipeline stages. For instance, a 5-stage pipeline might correspond roughly to:

IF (Instruction Fetch): Access Instruction Memory using PC.
ID (Instruction Decode / Register Fetch): Decode instruction, read source registers from Register File.
EX (Execute / Address Generation): ALU performs operation or calculates memory address.
MEM (Memory Access): Access Data Memory for loads/stores.
WB (Write Back): Write result to Register File.

Is this the correct partitioning? As seen in the single-cycle analysis, the delays for these conceptual stages are often not uniform (e.g., memory access might take longer than register read). Balancing these stage delays is crucial but difficult.

Enabling Pipelined Processing: Pipeline Registers

To allow multiple instructions to be in different stages simultaneously, we need to isolate the stages. This is done by adding Pipeline Registers between stages. These registers latch all the relevant data and control signals associated with an instruction at the end of a stage, and pass them to the next stage on the next clock cycle.

These pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) hold the state necessary for an instruction to progress through the pipeline independently of the instructions in other stages. For example, the IF/ID register holds the fetched instruction and the incremented PC, allowing the Fetch stage to immediately start fetching the next instruction in the next cycle, while the Decode stage works on the instruction stored in IF/ID.

Pipelined Operation: Data Flow

As an instruction moves through the pipeline, the data it needs (e.g., register values, immediate operands, memory addresses, memory data) is passed along in the pipeline registers to the stage where it is needed.

Pipelined Operation: Control Flow

Similarly, the control signals needed for an instruction are generated (typically in the ID stage, like in our multi-cycle FSM control) and passed down the pipeline registers along with the instruction’s data. Each stage uses the control signals relevant to that stage from the pipeline register arriving at its input.

The pipeline registers ensure that the correct control signals for a given instruction arrive at the correct stage at the correct time.

Instruction Pipeline Issues

As we hinted earlier, the instruction pipeline is not perfectly ideal like the laundry or automobile assembly line. Several properties are violated, leading to performance challenges:

Identical Operations: Not all instructions are identical. They have different operations (ALU, Load, Store, Branch, Jump) and thus utilize different stages and resources. This leads to external fragmentation, where some stages might be idle for certain instructions (e.g., a Jump instruction doesn’t use the MEM or WB stages for data access/writeback). Forcing all instructions through all stages (to simplify control) means some stages are wasted for some instructions.
Uniform Suboperations: The work in each stage might not have uniform latency. As seen in our single-cycle analysis, memory access is often slower than ALU operations or register access. In a synchronous pipeline, the clock cycle time is set by the slowest stage. This leads to internal fragmentation, where faster stages sit idle for part of the clock cycle, waiting for the slowest stage to complete.
Independent Operations: Instructions are not always independent. They often have dependencies, particularly data dependencies (one instruction needs a value produced by a prior instruction) and control dependencies (a branch instruction determines the address of the next instruction to fetch). These dependencies disrupt the smooth flow of the pipeline and lead to pipeline stalls.

Issues in Pipeline Design (Detailed)

These violations give rise to the core issues in pipeline design:

Balancing Work in Pipeline Stages: Designing the stages such that their latencies are as close as possible to minimize the clock cycle time.
Keeping the Pipeline Correct, Moving, and Full: This is the biggest challenge, especially in the presence of events that disrupt flow. It requires handling:
- Dependencies: Data and Control dependencies.
- Resource Contention: When multiple instructions need the same resource in the same cycle.
- Long-Latency Operations: Multi-cycle operations like memory accesses.
Handling Exceptions and Interrupts: Events that require stopping normal execution and switching to a handler.
Advanced Techniques: Improving throughput further by minimizing stalls and other inefficiencies.

Causes of Pipeline Stalls

As listed earlier, a pipeline may stall due to:

Resource Contention: Multiple instructions need the same hardware resource concurrently, and that resource is not duplicated or multi-ported sufficiently.
Dependencies: An instruction needs a result from a prior instruction that is not yet available.
Long-Latency (Multi-Cycle) Operations: An instruction in a stage takes longer than one clock cycle, blocking the progress of subsequent instructions into that stage.

Dependencies and Their Types (Revisited)

Dependencies, sometimes called hazards, are crucial to understand. They dictate the necessary ordering of instructions.

Data Dependencies:
- Flow Dependence (RAW - Read After Write): A later instruction reads a register that a prior instruction writes. R2 = R1 + 1, R3 = R2 + 2. Instruction 2 depends on Instruction 1’s result in R2. This is a true dependence on the value.
- Anti Dependence (WAR - Write After Read): A later instruction writes a register that a prior instruction reads. R3 = R1 + 1, R1 = R4 + 2. Instruction 2 writes R1 after Instruction 1 reads R1. This is a false dependence due to register name reuse.
- Output Dependence (WAW - Write After Write): A later instruction writes a register that a prior instruction also writes. R3 = R1 + 1, R3 = R4 + 2. Instruction 2 writes R3 after Instruction 1 writes R3. This is also a false dependence due to register name reuse.
Control Dependence: The execution of an instruction depends on the outcome of a prior control-flow instruction (e.g., whether a branch is taken or not). This determines which instructions are fetched next.
Resource Contention: While sometimes called “resource dependence,” this is distinct as it’s about hardware limitations, not the inherent program semantics.

Handling Resource Contention (Revisited)

Resource contention (or structural hazards) occurs when the hardware doesn’t have enough copies of a resource to support concurrent instructions needing it in the same cycle.

Solution 1: Eliminate the Cause: Duplicate the resource (e.g., separate Instruction and Data caches/memories, multiple register file ports, multiple ALUs if needed concurrently in different stages).
Solution 2: Detect and Stall: If duplication is too costly, detect when contention occurs and stall one of the instructions. Typically, you stall the earlier instruction requesting the resource.

Handling Data Dependencies (Focus on Flow/RAW)

Flow dependencies (RAW) are critical because they involve the correct transfer of values. Anti (WAR) and Output (WAW) dependencies are often easier to handle, especially in out-of-order processors, by techniques like Register Renaming (which we will touch on later). For now, we focus on RAW.

RAW Dependence Problem: A later instruction needs a value produced by an earlier instruction, but the value isn’t available when the later instruction reaches the stage where it needs that value (e.g., ID/EX stage for register operands).

Solutions for RAW Dependencies

Detect and Wait (Stall):
- Detect the RAW dependence (e.g., in the ID stage, compare source register IDs of the current instruction with destination register IDs of instructions in later stages).
- If a dependence is detected where the value is not yet available, stall the dependent instruction in its current stage (e.g., ID).
- This stall must also stop the fetch of new instructions into the pipeline (disable PC update and IF/ID register write enable).
- The stages after the stalled instruction continue to operate and drain, potentially introducing “bubbles” (empty or invalid pipeline slots) in the pipeline.
- Hardware for Stalling: Implementing stalls requires adding control logic (Dependence Detection Logic) and hardware support to the pipeline registers (enable/clear inputs). When a stall is asserted for a stage (e.g., ID), the write enable for the register before that stage (IF/ID) is disabled, and the PC update is disabled. An “invalid” signal or flush mechanism clears the pipeline registers after the stalled stage to introduce bubbles.
Detect and Forward/Bypass:
- Waiting until the value is written back to the register file (as in the simple stall approach) can lead to significant stalls (e.g., 3 cycles for a MEM-EX dependency in our 5-stage pipeline).
- Observation: The correct value is often available before it is written back to the register file (e.g., at the output of the ALU for R-type/Address Calc, or at the output of the Data Memory for Load).
- Idea: Add data forwarding paths (wires and multiplexers) to bypass the register file and send the value directly from the stage where it becomes available (ALU output, Memory output) to the stage where it is needed (ALU input).
- Mechanism: The Dependence Detection Logic checks for RAW dependencies. If one is detected, it determines which later pipeline register holds the correct value. Instead of stalling the dependent instruction, it asserts control signals on multiplexers before the dependent instruction’s required input (e.g., ALU inputs) to select the value from the appropriate forwarding path rather than the Register File output.
- Forwarding is not always possible: Some dependencies require values that are not available early enough to be forwarded without lengthening the clock cycle. For example, a Load instruction reads data from memory in the MEM stage. The value is not available until the end of the MEM stage. If the very next instruction needs this value in its EX stage, the value is not ready in time for the EX stage’s ALU inputs without a very fast (and potentially clock-cycle-limiting) bypass path from the MEM stage output to the EX stage input within the same cycle. This MEM-EX dependency often requires a stall even with forwarding.
- Combined Approach: A real pipeline uses a combination of forwarding (to resolve most RAW dependencies) and stalling (for dependencies that forwarding cannot cover, like MEM-EX, or for resource contention).

Conclusion

Today, we introduced the powerful concept of Pipelining, demonstrating how overlapping instruction execution in different stages can significantly increase processor throughput. We identified critical issues that prevent pipelines from achieving ideal performance, including dependencies (data and control), resource contention, and non-uniform stage latencies.

Continue here: 13 Pipelined Processor Design - Data & Control Dependence Handling

CS Notes

Explorer